The Problem of Anomalous Data David J. Birnbaum David J. Birnbaum is Associate Professor and Chair of the Department of Slavic Languages and Literatures at the University of Pittsburgh. His research in electronic text technology is concentrated primarily on problems of encoding and processing medieval Slavic manuscript materials. Keywords: SGML, DTD, anomalous data, legacy documents, validation Abstract: SGML was developed primarily for encoding new texts, and for ensuring that the structure of these texts conforms to a grammar specified in a DTD. The use of SGML to produce new electronic editions of existing print documents, a common operation in humanities computing, raises a problem that is not present when one creates new electronic documents that have no ancestors: existing paper texts may violate their underlying structures because of human error during their compilation or production, while their SGML counterparts are expected to conform to the structure specified in a DTD. There are various workable engineering solutions to this problem, but the availability of these strategies should not obscure the philosophical problem of encoding what is in essence an invalid document within a framework designed specifically to support validity. Background SGML (Standard Generalized Markup Language, ISO 8879:1986) was developed primarily for encoding new texts, an environment in which the rigorous adherence to a DTD (Document Type Definition) ensures that the resulting documents will observe a coherent structure. Consider, for example, the Oxford English Dictionary, which is divided into, among other things, lexical entries, each of which consists of the following information (described in the introduction to the dictionary itself: 1. identification a. main form (required) b. pronunciation (required; within parentheses) c. grammatical designation (optional; omission means substantive) d. specification (optional; e.g. music, biology) e. status (optional; e.g., obsolete, archaic, colloquial, dialect) f. earlier forms (optional) g. inflections (optional) 2. morphology a. etymology (required) b. subsequent history (optional) c. miscellaneous comments (optional) 3. signification (meaning, subdivided into multiple hierarchies that trace the development of different meanings, with illustrative quotations for each) The following is a simplified partial illustration: "Chaser2 (phonetic transcription in parentheses omitted here due to typographic limitations) [f. CHASE v.2 + -ER] 1 One who chases or engraves metal.1707 EARL BINDON in Lond. Gaz. No. 4339/3 Engravers, Carvers, Chacers. 1762-71 H. WALPOLE Vertue's Anecd. Paint. (1786) I. 153 Enamellers and chasers of plate." One way to represent the structure of dictionary entries of this type (partially, and with considerable simplification) would be the following (assume that the content of all undefined elements is #PDCATA): See Berg concerning the conversion of the printed Oxford English Dictionary to electronic form and Young-Lai concerning one strategy for inferring a DTD from a non-SGML document with in-line tags. The text of the dictionary excerpt cited above might be marked up in conformity with this DTD fragment (with some simplification) as follows:
Chaser
(phonetic transcription; omitted here due to typographic limitations)
f. CHASE v.2 + -ER One who chases or engraves metal. 1707. Earl Bindon in Lond. Gaz. No. 4339/3. Engravers, Carvers, Chacers. 1762–71. H. Walpole Vertue's Anecd. Paint. (1786) I. 153 Enamellers and chasers of plate.
Users who create new dictionaries based on this DTD will be required by their SGML editing and validating tools to follow the specified structure. The model provides some flexibility, so that, for example, authors may include or omit an indication of the lexical status of a particular entry. But this flexibility is restricted, so that, for example, if a element is included, it must follow, rather than precede, the obligatory element. SGML software is not, of course, able to ensure that the author actually enters status information into the element, but the software can, at least, verify that the element itself occurs only in a legal environment. An SGML environment thus protects users from inadvertently creating syntactically contradictory documents. A Historical Interlude Samuel Johnson, author of the 1755 Dictionary of the English Language, the first dictionary of any significance that attempted to document not just rare or unusual English words, but the English language in general, waggishly defined a lexicographer as "a harmless drudge." Those who have read Simon Winchester's recently-published study entitled The Professor and the Madman know better; as we shall see below, lexicographers may sometimes be harmful drudges, as well. The great organizational philosophy underlying the Oxford English Dictionary was that the dictionary would be based on historical principles, which meant, among other things, that the history of words and their meanings would be determined by and documented with citations from the entire history of writing in the English language. This quest seems almost quixotic for a time before computers, and the raw citations from which the Oxford English Dictionary was constructed were assembled manually by an army of volunteer readers, who combed selected works for quotations that would illustrate the use or meaning of every word in the language. As Simon Winchester tells us, The following summary of Minor's history is paraphrased from Winchester's recent book. one of the most prolific of these readers was Dr. William Chester Minor, a retired U.S. Army surgeon who had served in the Union army during the American Civil War, and who had then moved to Lambeth, then outside London, after his retirement from the army. Dr. Minor contributed tens of thousands of citations to the Oxford English Dictionary project, more than almost any other reader, but the most unusual aspect of his participation in the dictionary project was that his contributions were prepared in and submitted from his cell in the Boardmoor Criminal Lunatic Asylum in rural Berkshire. Dr. Minor's forced retirement from the U.S. Army had been for reasons of mental illness, and his insanity had included, among other things, a paranoid fear of the Irish. Late one night in 1871, shortly after arriving in England, Dr. Minor had rushed out onto the street in Lambeth, where he shot and killed an innocent English laborer under the delusion that the man was part of an Irish conspiracy to break into his rooms at night and molest him. In keeping with British justice, Dr. Minor was found not guilty by reason of insanity and sentenced to be "detained in safe custody until Her Majesty's Pleasure be known." In fact, his incarceration was to last most of his life, and he was transferred to a Stateside hospital, where he could be visited by his relatives, only in extreme old age. I have related a bit of Dr. Minor's history to demonstrate the the Oxford English Dictionary has long attracted the attention ofùshall we sayùeccentrics. In my own rather more harmless case, I shall argue that the dictionary illustrates a type of encoding problem that may not have been anticipated when SGML was developed, and for which invalid SGML of the sort that generates parser error messages may be the most appropriate way to represent the information in question. The Problem In a standard SGML authoring model, SGML tools can ensure that newly-created documents conform to a DTD developed by the author. This model is appropriate in an environment where SGML tools are used to create structured documents, but it is somewhat less well suited to the production of electronic versions of preexisting print (or even non-SGML electronic) documents, an extremely common enterprise in humanities computing. Such transcriptions are problematic because preexisting documents that were created outside SGML editors may, owing to the fallibility of human editors, violate the overall logical structure of those documents. For example, an isolated dictionary entry might improperly omit an obligatory element, or place it out of an otherwise strict and regular position. This type of error actually occurs in the second edition of the Oxford English Dictionary. Variants traditionally precede etymology, but for maverick the etymology precedes the variants. Compare the entries for mavourneen (regular) and maverick (anomalous: "maverick (pronunciation omitted for typographic reasons), sb. [Samuel A. Maverick (1803û1970), a Texas cattle-owner who left the calves of his herd unbranded.] Also mavorick." "mavourneen (pronunciation omitted for typographic reasons). Also 9 mavournin. [Irish mo mhurnÝn.] My darling." This sort of error leaves the editor of the electronic edition with several unattractive choices, including: 1. "Correct" the original text during transcription; 2. Create a loose DTD, which does not enforce the presence or order of elements strictly; 3. Create a strict DTD, but incorporate an escape hatch structure, which treats deviations as grammatically valid alternatives; 4. Create an invalid document that violates its DTD. The first three strategies are compatible with SGML processing: all three yield a DTD and document instance that can be processed with standard SGML tools. The fourth, on the other hand, yields a result that generates parser error messages and results that at best are undefined and at worst are crippling. The interesting issues, then, are of two types: those that distinguish the three SGML-compatible solutions from one another and those that support the SGML-incompatible solution. It may be worth recalling at this point that it is not the responsibility of an SGML document-processing environment to identify all types of logical errors. For example: 1. As noted above, SGML software has no way to verify whether a user has entered pronunciation information correctly inside the element and etymological information inside the element, rather than switching them around. SGML is concerned primarily with document structure, and (except for the special treatment of certain SGML markup characters) SGML's interest in the specific bytes that comprise each piece of PCDATA content is limited to verifying that all character data contains only legal characters 2. Even if pronuncation information has been entered in the element and etymological information in the element, SGML software has no way to check whether this information has been entered without error. As in the preceding case, SGML may be concerned with where PCDATA may occur within a document, but it is indifferent to the makeup of each instance of PCDATA itself. 3. Not only does SGML software not examine the particular character data that occurs in a specific PCDATA environment, but it has no way of monitoring whether a PCDATA location contains any data at all. Because a requirement for PCDATA in a particular location may be satisfied by zero data characters, a user who includes an element but forgets to enter the textual content will not be notified of any error by an SGML application. 4. Finally, SGML is not concerned with the semantic appropriateness of generic identifiers (the names of elements). To use the label systematically to tag etymological information and systematically to tag pronunciation information would be monumentally confusing to a human, but it would not be an SGML error, because generic identifiers are merely arbitrary signs as far as an SGML system is concerned. These SGML-irrelevant errors invite us to ask what sorts of errors an SGML system should monitor, and the obvious answer is that SGML is concerned with the syntax of documents, and, specifically, with verifying that the syntax of a document instance obeys the syntactic rules established in an associated DTD. Deviations from the DTD within the document instance constitute SGML errors, and must be reported as such by a validating SGML parser. Those who work with SGML texts are used to thinking of valid SGML as the inevitable goal of our encoding projects, and we assume that error messages are generated by a parser to alert us to the presence of faulty data, which we then normally repair before publication or further processing. We are not conditioned to think of syntactically invalid SGML as a natural or desirable state, or as a practical or appropriate way of representing syntactically contradictory source data. Practical Solutions Assumptions My evaluation of the various practical solutions to dealing with anomolies in source documents is based on two assumptions. First, I assume that where the meaning or function of character data is unambiguous, that data should be transcribed faithfully. Editorial comments about errors in the original source document are not excluded, but for reasons discussed below, these should be restricted to markup, rather than introduced by altering the character data content of the document. Second, I assume that a DTD is an interpretive statement, rather just an engineering convenience. A DTD represents the editor's analysis of the structure of a document, as inferred from document analysis. If we find errors in our source document, a faithful representation of the source document would show a structure with violations, because to represent the document as structurally consistent would contradict the conclusions implied by document analysis. Editorial Correction The first of the solutions noted above, changing text during transcription to correct errors in an original source document, is unattractive because transcriptions of (to continue the original example) existing print dictionaries have two essences: they are new and functional electronic dictionaries and they are electronic records of existing archµographic materials, viz. print dictionaries. While correcting errors observes the spirit of the first of these essences, since it produces a more useful and practical electronic reference work, it runs counter to the second, in that it simply suppresses information about the original source document. The TEI (Text Encoding Initiative) DTD addresses a superficially related problem: the treatment of anomalous character data during textual transcription. For example, errors in character data in original sources may be tagged as , with the correction stored in a corr attribute. Alternatively, the correction may be entered as content in a element, with the original reading stored in a sic attribute. But despite the superficial similarity between anomalous character data and structural anomaly, SGML does not readily support markup of markup through the use of attributes, and the TEI DTD contains no comparable proposal for wrapping anomalous structures in an umbrella element that would facilitate the specification of two types of markup as markup, one anomalous (as found in the original source) and the other logically correct but philologically unfaithful. The Loose DTD The second alternative, creating a loose DTD that does not enforce a strict element order (by, for example, replacing the commas with ampersands in the content model portion of the element declaration above, or making all elements of a content model optional, so as to cater to accidental omissions in the source), is unattractive because it sacrifices structural information. If the evidence of document analysis clearly points to the transgression in a particular place of a structure that otherwise observes a strict order, rather than to conformity to a loose structure that does not require strict order, a DTD based on the latter conceals information about the document, viz. what is regular and what is exceptional. If the DTD is viewed as a formal model of the editor's interpretation of the structure of a document, it is properly a potential object of study in its own right, and suppressing the distinction between the regular and the exception distorts the model. One might wonder, when confronted with a very small number of articles in the dictionary that share some structural peculiarity, whether they are a) a very rare but perfectly valid variation on the standard pattern, b) errors, or c) the sole remaining evidence that the compilers of the original dictionary had a mental model of the dictionary entry that permitted far more variation than they (almost) ever actually used? This question is a particular application of a general problem that arises whenever we transcribe existing print sources. One advantage commonly cited for descriptive over procedural markup is that procedural markup may neutralize and conflate structurally different elements. For example, printed document may use italics to represent emphasis, book titles, and foreign words, all of which may occur in some of the same environments. SGML does not prohibit the use of procedural markup tags such as , but most approaches to encoding such texts for humanities research would assign descriptive tags such as , , and <foreign> to these pieces of text. In most cases the function of italic type in a specific context in a printed source document is obvious, but apparent ambiguities are possible, and it may not be easy to describe the algorithm a human editor would use to resolve them. In the problem cited above, assuming a very large volume of consistent material, I would rule out (a) unless I could identify a reason for the dictionary compiler to have deviated deliberately from a pattern followed consistently elsewhere. And I would usually rule out (c) as possibly true but uninteresting, in that it is always possible that the compiler had a mental model that was broader even than anything actually printed in the dictionary. Interpreting typographic features structurally suffers from the same limitations as any historical reconstruction, and historical records are often faulty. Our decision in the present case may come down to a subjective conclusion about whether it is more probable that the compiler had a loose mental model than that he committed a small number of errors. The Escape Hatch The third alternative, based on a combination of a structured DTD plus an escape hatch for deviant data, may be the best approach from an engineering perspective. This is the solution adopted by the TEI DTD, which for transcriptions of print dictionaries distinguishes <entry>, a well-structured lexical entry, from <entryfree>, which may contain any dictionary elements in any combination, and therefore caters to lexical entries that violate normal structural constraints. One limitation of the TEI solution, however, is that it is applicable specifically to print dictionaries, even though these are not the only documents that may contain structural errors, since other types of preexisting documents may also omit or misplace elements through oversight. One might generalize the TEI solution by creating <ûfree> counterparts for other elements, or by creating an element <error>, which may occur anywhere and contain anything, but with this approach the fact that the content of such elements is erroneous, rather than correctly and appropriately unusual or unconstrained, becomes purely a semantic matter, even though by nature the error is syntactic. That is, markup in general is supposed to represent the syntax of the document, but the syntactic error is represented not by syntactically erroneous markup, but by the extrasyntactic meaning assigned to specific elements. Since the principal thing SGML understands is the distinction between syntactic validity and invalidity, is there not something philosophically unsatisfactory about this dislocation? In Defense of Invalid SGML The solution that best preserves the integrity of the original source, encodes explicitly the difference between norm and violation of norm, and represents syntactic anomaly in the source as syntactic anomaly in the document instance is the fourth alternative, the creation of an invalid document that violates its DTD. This solution is unavailable and, indeed, almost kinky or subversive in an SGML context because our conditioned perception of SGML as a way of modeling document structure assumes both that documents should be structured and that a document's structure should be represented by its DTD. The weakness in this perspective is illustrated by the examples cited above from the Oxford English Dictionary: documents created in obvious conformity to a very explicit structure (such as dictionaries), but created without the assistance of SGML tools, may occasionally violate their structure through human error. These violations are informational, at least from an archµographic perspective, and should be preserved. And what should be preserved is not merely that the offending portions observe a different but unremarkable structure, and certainly not that the overall document structure is generally loose. What should be preserved is what document analysis reveals: the document has a highly-structured implicit DTD and the offending portions are conceptual violations of this DTD, rather than alternative valid structures. Although SGML distinguishes valid from invalid syntactic structures, and although documents may contain a smattering of true (authentic, imported from a source and requiring preservation) syntactically invalid structures amid a sea of valid ones, the "escape hatch" solution preserves the distinction only by translating it from the syntactic to the semantic. The requirement that SGML be syntactically valid seems in most contexts so obvious that it would rarely be questioned, but if document analysis of existing documents reveals violations of the structure identified through document analysis, the most appropriate model of this information in SGML terms involves invalid SGML. If we are driven to an "escape hatch" decision because the creation of invalid SGML is foreclosed for practical reasons, we should not lose sight of three things: a) violations of basic structure are syntactic, b) what we are encoding in elements like <entryfree> are the equivalent of anticipated parser error messages, and c) the fact that a document may violate its basic structure in specific places is informational. The Role of the SGML Parser The preceding three thoughts are supported by considerations rooted in questions of the philosophy of markup, but from an engineering perspective they pose several obvious practical obstacles, derived from the inability of current SGML software to deal productively with violations within a document instance of the syntax specified by the associated DTD. Among these problems: 1. Parser error messages are not standardized, which complicates the automated processing of such messages across systems. 2. The type of intentional error described above is not easily distinguished from unintentional errors that the user will not wish to preserve or process, and that should simply be corrected when reported during document preparation. 3. Parser error messages may not identify the exact syntactic nature of the error in a useful way. For example, a parser may be unable to distinguish a transposition from a combination of omission and insertion. SGML parsers may be able to recognize some types of errors well enough to generate informative error messages, recognizing, in effect, both a correct grammar that raises no errors and a looser grammar that includes a number of incorrect constructions that the parser can nonetheless identify. This raises the possibility of using the notion of multiple grammars to enrich parser output, so that a document might, for example, be evaluated not only as correct or incorrect, but as correct or incorrect in a variety of specific ways. Elements that have been omitted or included erroneously are already recognized by such parsers as nsgmls; recognizing elements that are misplaced as dislocations, rather than unexpected omissions combined with unexpected insertions, may prove more difficult. In an authorizing environment enriched in this way, the system might query the user upon encountering a parsing error for the first time. The user would either correct the error or inform the system of how such erroneous structures should be mapped automatically in the future to valid structures. The interjection of this type of associative layer into the model allows the document instance to preserve the syntax of the original, it allows the DTD to model the abstract structure underlying the original even when that structure is not followed with absolute fidelity, and it provides a format where users can specify in formal ways the relationships between ideal and actual markup without compromising the integrity of either the transcription of the primary source or the DTD that purports to represent the syntactic structure underlying that source. Acknowledgements An earlier version of this paper was presented under the title In Defense of Invalid SGML at the 1997 annual joint meeting of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing (ACH/ALLC) at Queens College, Kingston, Ontario. I am grateful to John Lavagnino, David Mundie, Michael Sperberg-McQueen, and Frank Tompa for comments on that earlier text. Bibliography Berg. Donna Lee Berg. 1989. The Research Potential of the Electronic OED2 Database at the University of Waterloo: A Guide for Scholars. Research Report OED-89-02. Waterloo: UW Centre for the New Oxford English Dictionary and Text Research. See especially the revised electronic publication based on this work at http://www.epas.utoronto.ca:8080/cch/Berg/Berg-1_Contents.html Johnson. Samuel Johnson. A Dictionary of the English Language. Oxford. 1755. OED. Oxford English Dictionary, Oxford: Clarendon Press, 1884û1928. (Published in fascicles. First edition published in bound volumes in 1933 and followed by four supplements 1972û86. Second edition 1989.) TEI. Guidelines for Electronic Text Encoding and Interchange (TEI P3). Chicago and Oxford: Text Encoding Initiative. 1994. Winchester. Simon Winchester. The Professor and the Madman. New York: HarperCollins. 1998. Young. Matthew Young-Lai. Application of a Stochastic Grammatical Inference Method to Text Structure. Technical Report CS-96-36. Waterloo: Department of Computer Science, University of Waterloo. 1996.