The Problem of Anomalous Data

A Transformational Approach


David J. Birnbaum

University of Pittsburgh
Department of Slavic Languages and Literatures
Telephone: 1 412 624 5712
Fax: 1 412 624 9714
Email: djb@clover.slavic.pitt.edu
URI: http://clover.slavic.pitt.edu/~djb/

David J. Birnbaum is Associate Professor and Chair of the Department of Slavic Languages and Literatures at the University of Pittsburgh. His research in electronic text technology is concentrated primarily on problems of encoding and processing medieval Slavic manuscript materials.

David A. Mundie

DASYS Inc.
Telephone: 1 412 321 4346
Email: mundie@dasyseda.com
URI: http://www.telerama.com/~mundie/

David A. Mundie is head of Software Quality Assurance at DASYS Incorporated. He has a long history of dabbling in the computer-assisted analysis of literary texts.


Abstract

In the context of encoding new texts, the use of automated tools to ensure conformance to a grammar specified in a DTD has many benefits, most notably the guarantee of correctness and the simplification of downstream processing applications. In the context of producing new electronic editions of existing print documents, however, the use of those tools is problematic, because the existing paper texts may violate their underlying structure due to human error during their compilation or production. The producer of such a document is torn between the need to preserve the historical record of the original text, on the one hand, and the need to produce a correct, validated document that conforms to a DTD on the other. Such texts raise the philosophical problem of encoding what is in essence an invalid document within a framework designed specifically to support validity.

The solution that we propose is to view the valid and the invalid texts as transforms of each other. By capturing the transformation rules between them, we can easily produce a correct SGML document instance while at the same time preserving the historical record.

(To appear in Markup Languages: Theory and Practice.)


Background

SGML (Standard Generalized Markup Language, ISO 8879:1986) was developed primarily for encoding new texts, an environment in which the rigorous adherence to a DTD (Document Type Definition) ensures that the resulting documents will exhibit a coherent structure. Consider, for example, the OED (Oxford English Dictionary), which is divided into, among other things, lexical entries, each of which consists of the following information (described in the introduction to the dictionary itself):

  1. identification
    1. main form (required)
    2. pronunciation (required; within parentheses)
    3. grammatical designation (optional; omission means substantive)
    4. specification (optional; e.g. music, biology)
    5. status (optional; e.g., obsolete, archaic, colloquial, dialect)
    6. earlier forms (optional)
    7. inflections (optional)
  2. variants
  3. morphology
    1. etymology (required)
    2. subsequent history (optional)
    3. miscellaneous comments (optional)
  4. signification (required; meaning, subdivided into multiple hierarchies that trace the development of different meanings, with illustrative quotations for each)

The following partial entry for "mavourneen" is a simplified partial illustration:

mavourneen (m@'vU@ni:n). Also 9 mavournin. [Irish mo mhurnín.] My darling.

One way to represent the structure of dictionary entries of this type (partially, and with considerable simplification) would be the following (assuming that the content of all undefined elements is #PDCATA):[1]

<!ELEMENT entry  - - (ident, variants?, morph, signif+) >
<!ELEMENT ident  - - (main, pron, gram?, spec?, status?, 
  early?, inflect?) >
<!ELEMENT morph  - - (etym, hist?, misc?) >
<!ELEMENT signif - - (mean, cites+) >
<!ELEMENT cites  - - (cite+) >
<!ELEMENT cite   - - (date, source, text) >
<!ELEMENT source - - (who?, where) >
<!ELEMENT where  - - (title, page?) >

The text of the dictionary excerpt for "mavourneen" cited above might be marked up in conformity with this DTD fragment as follows:[2]

<entry>
  <ident>
    <main>mavourneen</main>
    <pron>m@'vU@ni:n</pron>
  </ident>
  <variant>mavournin</variant>
  <morph>
    <etym>Irish mo mhurn&iacute;n</etym>
  </morph>
  <signif>
    <mean>My darling.</mean>
  </signif>
</entry>

Users who create new dictionaries based on this DTD will be required by their SGML editing and validating tools to follow the specified structure. The model provides some flexibility, so that, for example, authors may include or omit an indication of the lexical status of a particular entry. But this flexibility is restricted; for example, if a <status> element is included, it must follow, rather than precede, the obligatory <pron> element. SGML software is not, of course, able to ensure that the author actually enters status information into the <status> element, but the software can, at least, verify that the <status> element itself occurs only in a legal context. An SGML development environment thus protects users from inadvertently creating syntactically contradictory documents.

The Problem

In a standard SGML authoring model, SGML tools can ensure that newly-created documents conform to a DTD developed for a particular purpose. This model is appropriate in an environment where SGML tools are used to create new structured documents, but it is less well suited to the production of electronic versions of pre-existing print (or even non-SGML electronic) documents, an extremely common enterprise in humanities computing. One reason that such transcriptions are problematic is that portions of pre-existing documents that were created outside SGML editors may, owing to the fallibility of human editors, violate the overall logical structure of those documents in general. For example, a dictionary entry might improperly omit an obligatory element, or place it out of an otherwise strict and regular position.[3]

To illustrate, consider the following two entries taken from the second edition of the OED. As we have seen, in this dictionary variants normally precede etymological information, but for the word "maverick" the variants follow, rather than precede, the etymology. Compare the entries for "mavourneen," which is regular, and for "maverick," which is anomalous:[4]

mavourneen (m@'vU@ni:n). Also 9 mavournin. [Irish mo mhurnín.] My darling.

maverick ('m{v@rKk), sb. [Samuel A. Maverick (1803-1970), a Texas cattle-owner who left the calves of his herd unbranded.] Also mavorick.

This sort of error leaves the editor of the electronic edition with several unattractive choices, including:

  1. "Correct" the original text during transcription;
  2. Create a loose DTD, which does not enforce the presence or order of elements strictly;
  3. Create a strict DTD, but incorporate an escape hatch structure, which treats deviations as grammatically valid alternatives;
  4. Create an invalid document that violates the proposed DTD.

The first three strategies are compatible with SGML processing: all three yield a DTD and a document instance that can be processed with standard SGML tools. The fourth, on the other hand, yields a document that when submitted to SGML tools generates parser error messages and undefined results at best, and at worst is simply rejected. The interesting issues, then, are of two types: those that distinguish the three SGML-compatible solutions from one another and those that support the SGML-incompatible solution.

It may be worth recalling at this point that the detection of many types of logical errors is the responsibility of processing applications themselves, and not of the SGML tool set per se. For example:

  1. As noted above, SGML software has no way to verify whether a user has entered character data representing pronunciation information correctly inside the <pron> element and character data representing etymological information inside the <etym> element, rather than inadvertently switching them around. SGML is concerned primarily with document structure, and except for the special treatment of certain SGML markup characters, SGML's interest in the specific bytes that constitute each piece of PCDATA content is limited to verifying that all character data contains only legal characters.[5]
  2. Even if character data representing pronunciation information has been entered in the <pron> element and character data representing etymological information in the <etym> element, SGML software has no way to check whether this information has been entered without error. As in the preceding case, SGML may be concerned with where PCDATA may occur within a document, but it is indifferent to the makeup of each instance of PCDATA itself.
  3. Not only does SGML software not examine the particular character data that occurs in a specific PCDATA environment, but it has no way of monitoring whether a PCDATA location contains any data at all. Because a requirement for PCDATA in a particular location may be satisfied by zero data characters, a user who includes an <etym> element but forgets to enter the textual content will not be notified of any error by an SGML application.
  4. Finally, SGML is not concerned with the semantic appropriateness of GIs (generic identifiers, the names of elements). To use the label <pron> systematically to tag etymological information and <etym> systematically to tag pronunciation information would be monumentally confusing to a human, but it would not be an SGML error, because GIs are merely arbitrary signs as far as an SGML system is concerned.

These SGML-irrelevant errors invite us to ask what sorts of errors an SGML system should monitor. The obvious answer is that SGML is concerned with the syntax of documents, and, specifically, with verifying that the syntax of a document instance obeys the syntactic rules established in an associated DTD. Deviations from the DTD within the document instance constitute SGML errors, and must be reported as such by a validating SGML parser.

Those who work with SGML texts are used to thinking of valid SGML as the inevitable goal of our encoding projects, and we assume that error messages are generated by a parser to alert us to the presence of unwanted faulty data, which we then normally repair before publication or further processing. We are not conditioned to think of syntactically invalid SGML as a natural or desirable state, or as a practical or appropriate way of representing syntactically contradictory source data.

Three Unsatisfactory Approaches

Requirements to be Met

Our evaluation of the various practical solutions to dealing with anomalies in source documents is based on three requirements.

First, we require that where the meaning or function of character data is unambiguous, that data should be transcribed faithfully. Editorial comments about errors in the original source document are not excluded, but, for philological reasons discussed below, these should be restricted to markup, rather than introduced by altering the character data content of the document.

Second, we require that a DTD be an interpretive statement, rather than just an engineering convenience. A DTD represents the human editor's analysis of the structure of a document, as inferred from document analysis.

Third, we recognize that documents produced by humans operating without the benefit of an SGML editing environment may contain errors, and we consider it the philological responsibility of the eventual human editor to distinguish a document's somewhat idealized structure from the reality that may contain fortuitous errors. In other words, the most direct SGML representation of inadvertent deviations from what is otherwise a coherent structure would be an SGML document with syntactic violations. This is in keeping with the second assumption, above, that a DTD is an interpretive statement. An SGML document may thus include the interpretive information that the document instance has a regular structure (formalized in the DTD) and also includes sporadic violations of that structure (encoded within the document instance).

From a more practical perspective, one may consider the DTD inferred from a print dictionary to be a formal statement about the structure of the dictionary. As such, the DTD itself may be regarded as an object of study, and its value as such would be restricted were there no formal distinction between the general structure of the dictionary, on the one hand, and the reality that may contain sporadic deviations from that structure, on the other. To represent a document with errors as syntactically consistent by allowing the DTD to license what philological analysis identifies as erroneous structures would deprive the DTD of much of its utility as a structural model.

Editorial Correction

The first of the four solutions noted above, changing text during transcription to correct errors in an original source document, is unattractive because transcriptions of existing print dictionaries (to continue the original example) have two essences: they are new and functional electronic dictionaries and they are electronic records of existing archaeographic materials, viz. print dictionaries. While correcting errors observes the spirit of the first of these essences, since it produces a more useful and practical electronic reference work, it runs counter to the second, in that it simply suppresses information about the original source document. When the history of the OED is written, scholars may choose to ignore certain errors that occurred in the pre-electronic print editions, but unless those errors are preserved in the source, individual scholars will be unable to make deliberate decisions about how to deal with them. If the editor simply modifies the entry for "maverick" to be:

<entry>
  <ident>
    <main>maverick</main>
    <pron>'m{v@rIk</pron>
    <gram>sb.</gram>
  </ident>
  <variant>Also mavorick.</variant>
  <morph>
    <etym>Samuel A. Maverick (1803-1970), a Texas 
      cattle-owner who left the calves of his herd 
      unbranded.</etym>
  </morph>
</entry>

then any trace of the original, anomalous data will have been lost.

The TEI (Text Encoding Initiative) DTD addresses a superficially related problem: the treatment of anomalous character data during textual transcription. For example, errors in character data in original sources may be tagged as <sic>, with an editorial emendation stored in a corr attribute. Alternatively, the editorial emendation may be entered as content in a <corr> element, with the original reading stored in a sic attribute. But despite the superficial similarity between anomalous character data and anomalous structure, SGML does not readily support markup of markup through the use of attributes, and the TEI DTD contains no comparable proposal for wrapping anomalous syntactic structures in an umbrella element that would facilitate the specification of two types of markup as markup, one anomalous (as found in the original source) and the other logically correct but philologically unfaithful.[6]

One complication that arises when working with print documents is that the print medium necessarily imposes a linear order, even when this order may appear to be more properly a matter of presentation than of structure. Much as footnotes and endnotes may be regarded as two presentational variants of a single structural note-type element, the order of the parts of a print dictionary entry may be regarded as purely a presentational artifact imposed by the print medium on information that alternatively may be encoded fully without reference to linear order by SGML tags. This issue emerges from an imbalance between the SGML model of data as a single ordered hierarchy of content objects (see the critique in Renear et al. 1993) and the inherently twofold existence of print documents as both archeographic objects (including all of the presentational features imposed by print) and structural abstractions.

The Loose DTD

The second alternative is to use a loose DTD that does not enforce a strict element order. For example, the ordered set of elements can be made unordered by changing the commas in the content model to ampersands, and accidental omissions in the source can be catered to by making all elements in the content model optional, as in this humorously noncommittal declaration--a Hamlet among DTDs:

<!ELEMENT entry  - - (ident? & variants? & morph?, signif*) >
<!ELEMENT ident  - - (main? & pron? & gram? & spec? & status? 
  & early? & inflect?) >
<!ELEMENT morph  - - (etym? & hist? & misc?) >
<!ELEMENT signif - - (mean? & cites*) >
<!ELEMENT cites  - - (cite*) >
<!ELEMENT cite   - - (date? & source? & text?) >
<!ELEMENT source - - (who? & where?) >
<!ELEMENT where  - - (title? & page?) >

This approach is unattractive because it sacrifices structural information. If the evidence of document analysis clearly points to the transgression in a particular place of a structure that otherwise observes a strict order, rather than to conformity to a loose structure that does not require strict order, a DTD based on the latter conceals information about the document, viz. what is regular and what is exceptional. If the DTD is viewed as a formal model of the editor's interpretation of the structure of a document, it is properly a potential object of study in its own right, and suppressing the distinction between the regular and the exceptional distorts the model.

One might wonder, when confronted with a very small number of articles in the dictionary that share some structural peculiarity, whether they are a) a very rare but perfectly valid variation on the standard pattern, b) errors, or c) the sole remaining evidence that the compilers of the original dictionary had a mental model of the dictionary entry that permitted far more variation than they (almost) ever actually used.

This question is a particular application of a general problem that arises whenever we transcribe existing print sources. One advantage commonly cited for descriptive over procedural markup is that procedural markup may neutralize and conflate structurally different elements. For example, a printed document may use italics to represent emphasis, book titles, and foreign words, all of which may occur in some of the same environments. SGML does not prohibit the use of procedural markup tags such as <italic>, but most approaches to encoding such texts for humanities research would assign descriptive tags such as <emph>, <title>, and <foreign> to these pieces of text, and it would be the responsibility of the editor to infer the meaning of italic type wherever it occurs. In most cases the function of italic type in a specific context in a printed source document is obvious, but apparent ambiguities are possible, and it may not be easy to describe the algorithm a human editor would use to resolve them.

In the problem cited above, assuming a large volume of consistent material, we would rule out (a) (rare but valid variation) unless we could identify a reason for the dictionary compiler to have deviated deliberately from a pattern followed consistently elsewhere. And we would usually rule out (c) (broader mental model) as possibly true but uninteresting, in that it is always possible that the compiler had a mental model that was broader even than anything actually printed in the dictionary. Interpreting typographic features structurally suffers from the same limitations as any historical reconstruction, and historical records are often faulty. A decision in the present case may come down to a subjective editorial conclusion about whether it is more probable that the compiler had a loose mental model than that he committed a small number of errors.[7]

The Escape Hatch

The third alternative is to combine a tightly-structured DTD with a loosely-structured escape hatch for deviant data. This is the solution adopted by the TEI DTD, which for transcriptions of print dictionaries distinguishes <entry>, a well-structured lexical entry, from <entryfree>, which may contain any dictionary elements in any combination, and therefore caters to lexical entries that violate normal structural constraints. Applying this solution to our example, we might have:

<!ELEMENT entryfree  - - (identfree? & variants? & morphfree? 
  & signiffree*) >
<!ELEMENT identfree  - - (main? & pron? & gram? & spec? & status? 
  & early? & inflect?) >
<!ELEMENT morphfree  - - (etym? & hist? & misc?) >
<!ELEMENT signiffree - - (mean? & citesfree*) >
<!ELEMENT citesfree  - - (citefree*) >
<!ELEMENT citefree   - - (date? & sourcefree? & text?) >
<!ELEMENT sourcefree - - (who? & wherefree?) >
<!ELEMENT wherefree  - - (title? & page?) >

The entry for "maverick" would then be:

<entryfree>
  <identfree>
    <main>maverick</main>
    <pron>'m{v@rIk</pron>
    <gram>sb.</gram>
  </identfree>
  <morphfree>
    <etym>Samual A. Maverick (1803-1970), a Texas 
      cattle-owner who left the calves of his herd 
      unbranded.</etym>
  </morphfree>
  <variant>Also mavorick.</variant>
</entry>

Although from an engineering standpoint this technique allows the editor to mark up the text in a way that can be processed by standard SGML tools, from an intellectual perspective it is unsatisfactory because it violates our second requirement, namely that the DTD be an interpretive statement. It seems unlikely that the editors of the OED were indulging in a little esoteric philological humor by recording the word in a totally free-form way so that the entry would reflect the meaning--by having "maverick" be a maverick, so to speak. In all probability they intended to use the strict entry, but goofed up. Using a loose entry, even as an escape hatch, obscures that fact.

Put another way, with this approach the fact that the content of such elements is erroneous, rather than correctly and appropriately unusual or unconstrained, becomes purely a semantic matter, even though by nature the error is syntactic. That is, markup in general is supposed to represent the syntax of the document, but the syntactic error in this case would be represented not by syntactically erroneous markup, but by the extrasyntactic meaning assigned to specific elements. This type of model involves less loss of information than in the first two solutions, since irregularities are identified as irregularities, albeit through the names of GIs, rather than through the distinction between syntactic validity and invalidity in the document instance. But since the principal thing SGML understands is the distinction between syntactic validity and invalidity, there is something philosophically unsatisfactory about this dislocation, and from an information-processing point of view there still is a loss of information, since the use of the <entryfree> element tells us nothing about the particular deviation of each anomalous element from its ideal counterpart.

A Transformational Solution

In Praise of Invalid Markup

The ironic truth is that nothing captures the reality of a mistake in legacy text as cleanly as invalid markup. A document instance that violates its DTD preserves the integrity of the original source, encodes explicitly the difference between norm and violation of norm, and represents syntactic anomaly in the source as syntactic anomaly in the document instance. This approach seems almost kinky or subversive in an SGML context, because our conditioned perception of SGML as a way of modeling document structure assumes that documents should be structured, that there should be no errors in a document's structure, and that a document's structure should be represented by its DTD.

From the perspective we advocate here, a document must be considered to have at least two structures: an ideal one (which must be inferred by scholars) and a concrete one that conforms to the ideal in most--but not necessarily all--places. The SGML document model described here is intended to support the use of the DTD to formalize a grammar of the ideal structure, while permitting the document instance to include violations of that structure.[8]

This model may be unusual in SGML terms, but it is standard operating procedure in other disciplines. For example, descriptive linguistics traditionally distinguishes competence from performance or langue from parole, where the first term represents the ideal abstract grammatical competence of a speaker and the second term represents his real speech, complete with unintended slips of the tongue. Within this model, linguistic performance is easy to observe and describe, while abstract linguistic competence can only be inferred from the study of performance. Linguists nonetheless recognize that the system of linguistic competence is an important object of study from several perspectives, including formal modeling. To return to an SGML environment, we suggest here that a DTD can be considered a model of the grammatical competence of a document, while the document instance represents the associated--and possibly imperfect--grammatical performance.

The traditional assumption that documents should not contain structural errors is challenged by the examples cited above from the OED, which demonstrate that documents that exhibit what in many respects appears to be a rigid, structure (such as dictionaries) may occasionally deviate from an otherwise very consistent structure through human error (or, to describe the situation from a less SGML-oriented perspective, as a result of the editors not being as preoccupied with formal structure as SGML authoring tools might be). These violations are informational, at least from an archaeographic perspective, and should be preserved. And what should be preserved is not merely that the offending portions observe a different but unremarkable structure, and certainly not that the overall document structure is generally loose. What should be preserved is what document analysis reveals: the document has a highly-structured implicit DTD and the offending portions are conceptual violations of this DTD, rather than alternative valid structures. Although SGML distinguishes valid from invalid syntactic structures, and although documents may contain a smattering of philologically true (authentic, imported from a source and requiring preservation) syntactically invalid structures amid a sea of valid ones, the "escape hatch" solution preserves the distinction only by translating it from the syntactic to the semantic. This translation, in turn, restricts the DTD to functioning as a grammar of performance, rather than of competence.

The requirement that SGML be syntactically valid seems in most contexts so obvious that it would rarely be questioned, but if document analysis of existing documents reveals violations of the structure identified during the document analysis process, the most appropriate model of this information in SGML terms involves invalid SGML.

Parser Error Messages as Specifications of Intertextual Relationships

Having concluded that invalid markup is philosophically the most satisfactory solution to the anomalous data problem, one might be tempted to let it go at that, maintaining an invalid document instance that generates error messages when submitted to SGML tools. To continue with our "maverick" example, one would simply grin and bear it when submitting the document to the SGML system:

<entry>
  <ident>
    <main>maverick</main>
    <pron>'m{v@rIk</pron>
    <gram>sb.</gram>
  </ident>
  <morph>
    <etym>Samual A. Maverick (1803-1970), a Texas 
      cattle-owner who left the calves of his herd 
      unbranded.</etym>
  </morph>
  <variant>Also mavorick.</variant>
***** Line 13: Element "variant" not allowed here.
</entry>

Unfortunately, this approach is just as unacceptable from an engineering viewpoint as the first three were from a philosophical one. Some of the problems it poses are:

  1. Parser error messages are not well standardized, which complicates the automated processing of such messages across systems and from one release of a tool to the next.
  2. The anomalous structure in the legacy text is not easily distinguished from structural errors, which the user will not wish to preserve or process, and which should simply be corrected when reported during document preparation. When presented with output such as:
    <entry>
      <ident>
        <main>maverick</main>
        <gram>sb.</gram>
    ***** Line 4: Element "gram" not allowed here
        <pron>'m{v@rIk</pron>
        <gram>sb.</gram>
      </ident>
      <morph>
        <etym>Samual A. Maverick (1803-1970), a Texas 
          cattle-owner who left the calves of his herd 
          unbranded.</etym>
      </morph>
      <variant>Also mavorick.</variant>
    ***** Line 13: Element "variant" not allowed here.
    </entry>

    it is a tedious, error-prone process to figure out that the error on line 4 should be corrected but the one on line 13 should not be.

  3. Parser error messages may not identify the exact syntactic nature of the error in a useful way. For example, a parser may be unable to distinguish a transposition from a combination of omission and insertion.[9]
  4. One of the principal advantages of using an automated tool to validate a document's structure is to simplify processing applications, which can assume that the documents given to them conform to their DTDs. Maintaining an invalid document means that at best one will get unpredictable results from downstream tools; at worst, the document will simply be rejected.
  5. Finally, from an intellectual perspective, error messages are simply poor specifications of the relationship between the ideal and concrete document structures. The error message
    ***** Line 13: Element "variant" not allowed here.

    is, in some very loose sense, a specification of the relationship between this dictionary entry and the DTD, but it is certainly not a very good one. It fails to capture the underlying concept that "the variant and the morphology have been interchanged", and certainly cannot be used to produce the correct version of the entry.

To be sure, there may be partial work-arounds to some of these problems within the context of traditional tools. SGML parsers may be able to recognize some types of errors well enough to generate informative error messages, recognizing, in effect, both a correct grammar that raises no errors and a looser grammar that includes a number of incorrect constructions that the parser can nonetheless identify. This raises the possibility of using the notion of multiple grammars to enrich parser output, so that a document might, for example, be evaluated not only as correct or incorrect, but as correct or incorrect in a variety of specific ways. Elements that have been omitted or included erroneously are already recognized by such parsers as James Clark's nsgmls; recognizing elements that are misplaced as dislocations, rather than unexpected omissions combined with unexpected insertions, may prove more difficult. Other techniques might include using architectural forms to map from a strict to a loose DTD, or using DSSSL or XSLT to do tree transformations. All of these approaches, however, suffer from the inherent clumsiness of attempting to deal with invalid data in an environment designed to enforce validity.

A Transformational Approach

Instead of simply maintaining an invalid document instance and attempting to work around the problems it causes, there is another solution which we feel allows one to eat one's markup cake and have it too. In this approach, the relationship between what we have called the "ideal" document and its concrete instantiation is specified as a set of rule-based text-to-text transformations. Rule-based rewriting systems are a well-understood topic in computer science, and provide a powerful paradigm for expressing the relationships between texts. The system we have in mind looks like this:

fig1.gif

The basic idea is to maintain the text in a valid SGML document from which the historical, invalid document can be automatically derived at any time for research purposes.[10] To this end, we allow the editor to record the discrepancy between the ideal element and the actual one with an attribute specifying how to transform the one into the other. For our "maverick" example, we might have a variant-morph transformation which captures the fact that the variant and the morphology have been interchanged:

<entry transform="variant-morph">
 <ident>
  <main>maverick</main>
  <pron>'m{v@rIk</pron>
  <gram>sb.</gram>
 </ident>
 <variant>Also mavorick.</variant>
 <morph>
  <etym>Samual A. Maverick (1803-1970), a Texas 
    cattle-owner who left the calves of his herd
    unbranded.</etym>
 </morph>
</entry>

To implement the transformational component, the Gema (Generalized Macro Language) is an obvious choice, since it was designed specifically for rule-based text-to-text transformations and allows the entire transformation engine to be written in just one line. Other pattern-matching languages such as Perl or Awk could be used as well.

To express the way the anomalous entry for "maverick" is related to its platonically correct one, we might write the variant-morph transformation rule in Gema as follows:

variant-morph:[variant][morph]=$2$1

This states in an intuitive way that in entries of this type, the morphology and the variant have been swapped. The recognizers for morph and variant are also quite easy to specify in Gema:

morph:<morph>[U]</morph>=$0@end
morph:=@fail

variant:<variant>[U]</variant>=$0@end
variant:=@fail

These expressions say quite simply that anything between morph tags is a morph, and anything between variant tags is a variant. It should be stressed that these Gema specifications are executable, so that applying the variant-morph transformation to the canonical entry for "maverick" will at any time generate the anomalous version:

<entry transform="variant-morph">
 <ident>
  <main>maverick</main>
  <pron>'m{v@rIk</pron>
  <gram>sb.</gram>
 </ident>
 <morph>
  <etym>Samual A. Maverick (1803-1970), a Texas 
    cattle-owner who left the calves of his herd
    unbranded.</etym>
 </morph>
 <variant>Also mavorick.</variant>
</entry>

The final piece of the picture is the one-line transformation engine that uses the transform attribute to call the appropriate transformation:

<entry transform\="[U]">[U]</entry>=@define{temp:[U]=\@$1\{\$0\}}@temp{$0}

We may break this rule into three pieces. The left-hand side matches entries with transform attributes:

<entry transform\="[U]">[U]</entry>=

For simplicity's sake we are assuming that the transform attribute will be the only attribute on entries. The define statement redefines the temp statement to be a call on the transformation specified by the transform attribute (the first pattern matched, referenced as "$1"):

@define{temp:[U]=\@$1\{\$0\}}

Finally, the entry as a whole is transformed using the redefined temp transform:

@temp{$0}

This is a very simple example, but should suffice to demonstrate the utility of the technique. Any textual elements that can be extracted by means of Gema's powerful context-sensitive parsing rules can be reordered, deleted, duplicated, and augmented by all the techniques available in programming languages. In our experience, the implementation cost of such transformations is a linear function of their conceptual complexity.

With this approach, we believe we have found the best of both worlds. The publisher can submit the canonical SGML version of the text to standard SGML tools and have it processed without error messages or other deleterious side effects, while the archeographer can at any time, at near-zero cost, produce a version of the text that represents the anomalous, historical print record.

Conclusion

The specification of anomalous data by means of text-to-text transformation rules seems to solve the problem of anomalous data. Intellectually it captures the essence of the problem and meets all of the philosophical desiderata we listed above. In addition, it has many advantages from the engineering viewpoint. It provides a complete, formal, machine-processable specification of the intertextual relationship. It is user-friendly--that is, intuitive, maintainable, and intellectually satisfying. It can generate both the anomalous and the ideal version of the document on demand. Finally, it is easily portable across computer system, and not tied to any particular SGML tool set.

Ultimately one can envisage merging this transformation-based system with a traditional SGML system, perhaps using the error-correction facilities of the parser to generate at least some of the transformation rules automatically. In an authoring environment enriched in this way, the system might query the user upon encountering a parsing error. The user would either correct the error or inform the system of how the erroneous structure should be mapped automatically to a valid structure. The interjection of this type of associative layer into the model allows the document instance to preserve the syntax of the original, it allows the DTD to model the abstract structure underlying the original even when that structure is not followed with absolute fidelity, and it provides a format where users can specify in formal ways the relationships between ideal and actual markup without compromising the integrity of either the transcription of the primary source or the DTD that purports to represent the syntactic structure underlying that source.


Notes

[1] See Berg 1989 concerning the conversion of the printed OED to electronic form and Young-Lai 1966 concerning one strategy for inferring a DTD from a non-SGML document with in-line tags. Painter 1998 is clearly relevant, but I have not been able to obtain access to a copy.

[2] Note that in a real document, the pronunciation would more probably be represented with SDATA character entities (for example the Unicode IPA (International Phonetic Alphabet) block), rather than the SAMPA (Speech Assessment Methods Phonetic Alphabet) transcription of IPA characters using ASCII (American Standard Code for Information Interchange).

[3] Quin 1996 distinguishes prescriptive from descriptive DTDs, with the latter suitable for creating "an electronic version of material that already exists in a non-SGML format" (415) in a way that records ambiguity within a valid SGML document. One might reasonably approach documents that were originally constructed outside an SGML environment as taking priority over the DTD that might be inferred after the fact, so that what is anomalous in such examples is not so much the data (text), but the structure. From this perspective, anomalous structures may be correct but not captured by the inferred DTD, incorrect with respect to the inferred DTD, or perhaps simply not capturable within any usable DTD.

[4] The only statement that may be made about the validity of a document within an SGML model is that the document is either valid or invalid. But as Asimov 1986 notes, there may be different degrees of wrong, and the more nuanced XML (Extensible Markup Language) model distinguishes documents that are valid (in the SGML sense) from those that are well-formed (the elements, delimited by start- and end-tags, nest properly within each other and there is a single root element, but their structure is not necessarily governed by a DTD). All valid XML documents are also well-formed, but the reverse is not true. The present defense of invalid documents assumes that the documents in question are well-formed.

It should also be noted that our description of the structure of the entry for "maverick" as anomalous is an interpretation, and one might alternatively argue that the order of the components of this entry represents a variant, rather than an anomaly. This issue is discussed in greater detail in the discussion of loose DTDs, below.

[5] SGML does support data typing through notations, although a) the actual type validation must be performed by an external process, rather than by the SGML parser itself, and b) not all constraints on data content are easily expressed in ways that lend themselves easily to automated validation, especially in a system-independent manner.

[6] See the discussion of escape hatches, below.

[7] Distinguishing the rare from the exceptional is, of course, a general classificatory problem not restricted to questions of textual markup.

[8] It might be instructive in this context to consider one of the principal differences between the DTD as part of the original SGML standard and schemas: SGML and XML allow only a single DTD, while a schema-based system would be able to associate a document instance with multiple schemata. A schema-based approach would thus enable a document to be associated with both of the structures described above. See Prescod 1998.

[9] For example, the structure <A><C><B><D> where <A><B><C><D> is expected may be interpreted in at least three ways:

  1. omission of <B> between <A> and <C> and insertion of <B> between <C> and <D>;
  2. omission of <C> between <B> and <D> and insertion of <C> between <A> and <B>; and
  3. transposition of <B> and <C> between <A> and <D>.

While all three produce the same result in this example, they have somewhat different implications on a more general scale. For example, interpretations 1 and 2 do not preclude omission without insertion or insertion without omission, while interpretation 3 views these two processes as interdependent.

[10] To be perfectly honest, it makes no difference whether the "master" document is the valid one or the invalid one.


Acknowledgements

The first version of this paper was presented by David J. Birnbaum under the title "In Defense of Invalid SGML" at the 1997 annual joint meeting of the ACH (Association for Computers and the Humanities)/ALLC (Association for Literary and Linguistic Computing) at Queens College, Kingston, Ontario. A second version was presented by the same author under the title of the present article at the Markup Technologies '98 conference in Chicago, IL and, with slight revision, at a meeting of the Pittsburgh (PA) XML/SGML Lecture Series (PghXSLS). The authors are grateful to John Lavagnino, C. Michael Sperberg-McQueen, Chris Maden, Paul Prescod, Michael Spring, and Frank Tompa for comments on those earlier texts, and to an anonymous referee for comments on an earlier version of this article.


Bibliography