The Problem of Anomalous Data
David J. Birnbaum
David J. Birnbaum is Associate Professor and Chair of
the Department of Slavic Languages and Literatures at
the University of Pittsburgh. His research in electronic
text technology is concentrated primarily on problems
of encoding and processing medieval Slavic manuscript
materials.
Keywords: SGML, DTD, anomalous data, legacy documents,
validation
Abstract: SGML was developed primarily for encoding
new texts, and for ensuring that the structure of these
texts conforms to a grammar specified in a DTD. The
use of SGML to produce new electronic editions of existing
print documents, a common operation in humanities computing,
raises a problem that is not present when one creates
new electronic documents that have no ancestors: existing
paper texts may violate their underlying structures
because of human error during their compilation or production,
while their SGML counterparts are expected to conform
to the structure specified in a DTD. There are various
workable engineering solutions to this problem, but
the availability of these strategies should not obscure
the philosophical problem of encoding what is in essence
an invalid document within a framework designed specifically
to support validity.
Background
SGML (Standard Generalized Markup Language, ISO 8879:1986)
was developed primarily for encoding new texts, an environment
in which the rigorous adherence to a DTD (Document Type
Definition) ensures that the resulting documents will
observe a coherent structure. Consider, for example,
the Oxford English Dictionary, which is divided into,
among other things, lexical entries, each of which consists
of the following information (described in the introduction
to the dictionary itself:
1. identification
a. main form (required)
b. pronunciation (required; within parentheses)
c. grammatical designation (optional; omission means
substantive)
d. specification (optional; e.g. music, biology)
e. status (optional; e.g., obsolete, archaic, colloquial,
dialect)
f. earlier forms (optional)
g. inflections (optional)
2. morphology
a. etymology (required)
b. subsequent history (optional)
c. miscellaneous comments (optional)
3. signification (meaning, subdivided into multiple
hierarchies that trace the development of different
meanings, with illustrative quotations for each)
The following is a simplified partial illustration:
"Chaser2 (phonetic transcription in parentheses omitted
here due to typographic limitations) [f. CHASE v.2 + -ER]
1 One who chases or engraves metal.1707 EARL BINDON
in Lond. Gaz. No. 4339/3 Engravers, Carvers, Chacers.
1762-71 H. WALPOLE Vertue's Anecd. Paint. (1786) I.
153 Enamellers and chasers of plate."
One way to represent the structure of dictionary entries
of this type (partially, and with considerable simplification)
would be the following (assume that the content of all
undefined elements is #PDCATA):
See Berg concerning the conversion of the printed Oxford
English Dictionary to electronic form and Young-Lai
concerning one strategy for inferring a DTD from a non-SGML
document with in-line tags.
The text of the dictionary excerpt cited above might
be marked up in conformity with this DTD fragment (with
some simplification) as follows:
Chaser(phonetic transcription; omitted here
due to typographic limitations)f. CHASE v.2 + -EROne who chases or engraves metal.1707.Earl Bindonin Lond. Gaz.
No. 4339/3.Engravers, Carvers, Chacers.1762–71.H. WalpoleVertue's Anecd.
Paint. (1786) I. 153Enamellers and chasers of plate.
Users who create new dictionaries based on this DTD
will be required by their SGML editing and validating
tools to follow the specified structure. The model provides
some flexibility, so that, for example, authors may
include or omit an indication of the lexical status
of a particular entry. But this flexibility is restricted,
so that, for example, if a element is included,
it must follow, rather than precede, the obligatory
element. SGML software is not, of course, able
to ensure that the author actually enters status information
into the element, but the software can, at
least, verify that the element itself occurs
only in a legal environment. An SGML environment thus
protects users from inadvertently creating syntactically
contradictory documents.
A Historical Interlude
Samuel Johnson, author of the 1755 Dictionary of the
English Language, the first dictionary of any significance
that attempted to document not just rare or unusual
English words, but the English language in general,
waggishly defined a lexicographer as "a harmless drudge."
Those who have read Simon Winchester's recently-published
study entitled The Professor and the Madman know better;
as we shall see below, lexicographers may sometimes
be harmful drudges, as well.
The great organizational philosophy underlying the Oxford
English Dictionary was that the dictionary would be
based on historical principles, which meant, among other
things, that the history of words and their meanings
would be determined by and documented with citations
from the entire history of writing in the English language.
This quest seems almost quixotic for a time before computers,
and the raw citations from which the Oxford English
Dictionary was constructed were assembled manually by
an army of volunteer readers, who combed selected works
for quotations that would illustrate the use or meaning
of every word in the language.
As Simon Winchester tells us,
The following summary of Minor's history is paraphrased
from Winchester's recent book.
one of the most prolific of these readers was Dr. William
Chester Minor, a retired U.S. Army surgeon who had served
in the Union army during the American Civil War, and
who had then moved to Lambeth, then outside London,
after his retirement from the army. Dr. Minor contributed
tens of thousands of citations to the Oxford English
Dictionary project, more than almost any other reader,
but the most unusual aspect of his participation in
the dictionary project was that his contributions were
prepared in and submitted from his cell in the Boardmoor
Criminal Lunatic Asylum in rural Berkshire. Dr. Minor's
forced retirement from the U.S. Army had been for reasons
of mental illness, and his insanity had included, among
other things, a paranoid fear of the Irish. Late one
night in 1871, shortly after arriving in England, Dr.
Minor had rushed out onto the street in Lambeth, where
he shot and killed an innocent English laborer under
the delusion that the man was part of an Irish conspiracy
to break into his rooms at night and molest him. In
keeping with British justice, Dr. Minor was found not
guilty by reason of insanity and sentenced to be "detained
in safe custody until Her Majesty's Pleasure be known."
In fact, his incarceration was to last most of his life,
and he was transferred to a Stateside hospital, where
he could be visited by his relatives, only in extreme
old age.
I have related a bit of Dr. Minor's history to demonstrate
the the Oxford English Dictionary has long attracted
the attention ofùshall we sayùeccentrics. In my own
rather more harmless case, I shall argue that the dictionary
illustrates a type of encoding problem that may not
have been anticipated when SGML was developed, and for
which invalid SGML of the sort that generates parser
error messages may be the most appropriate way to represent
the information in question.
The Problem
In a standard SGML authoring model, SGML tools can ensure
that newly-created documents conform to a DTD developed
by the author. This model is appropriate in an environment
where SGML tools are used to create structured documents,
but it is somewhat less well suited to the production
of electronic versions of preexisting print (or even
non-SGML electronic) documents, an extremely common
enterprise in humanities computing. Such transcriptions
are problematic because preexisting documents that were
created outside SGML editors may, owing to the fallibility
of human editors, violate the overall logical structure
of those documents. For example, an isolated dictionary
entry might improperly omit an obligatory element, or
place it out of an otherwise strict and regular position.
This type of error actually occurs in the second edition
of the Oxford English Dictionary. Variants traditionally
precede etymology, but for maverick the etymology precedes
the variants. Compare the entries for mavourneen (regular)
and maverick (anomalous:
"maverick (pronunciation omitted for typographic reasons),
sb. [Samuel A. Maverick (1803û1970), a Texas cattle-owner
who left the calves of his herd unbranded.] Also mavorick."
"mavourneen (pronunciation omitted for typographic reasons).
Also 9 mavournin. [Irish mo mhurnÝn.] My darling."
This sort of error leaves the editor of the electronic
edition with several unattractive choices, including:
1. "Correct" the original text during transcription;
2. Create a loose DTD, which does not enforce the presence
or order of elements strictly;
3. Create a strict DTD, but incorporate an escape hatch
structure, which treats deviations as grammatically
valid alternatives;
4. Create an invalid document that violates its DTD.
The first three strategies are compatible with SGML
processing: all three yield a DTD and document instance
that can be processed with standard SGML tools. The
fourth, on the other hand, yields a result that generates
parser error messages and results that at best are undefined
and at worst are crippling. The interesting issues,
then, are of two types: those that distinguish the three
SGML-compatible solutions from one another and those
that support the SGML-incompatible solution.
It may be worth recalling at this point that it is not
the responsibility of an SGML document-processing environment
to identify all types of logical errors. For example:
1. As noted above, SGML software has no way to verify
whether a user has entered pronunciation information
correctly inside the element and etymological
information inside the element, rather than switching
them around. SGML is concerned primarily with document
structure, and (except for the special treatment of
certain SGML markup characters) SGML's interest in the
specific bytes that comprise each piece of PCDATA content
is limited to verifying that all character data contains
only legal characters
2. Even if pronuncation information has been entered
in the element and etymological information in
the element, SGML software has no way to check
whether this information has been entered without error.
As in the preceding case, SGML may be concerned with
where PCDATA may occur within a document, but it is
indifferent to the makeup of each instance of PCDATA
itself.
3. Not only does SGML software not examine the particular
character data that occurs in a specific PCDATA environment,
but it has no way of monitoring whether a PCDATA location
contains any data at all. Because a requirement for
PCDATA in a particular location may be satisfied by
zero data characters, a user who includes an
element but forgets to enter the textual content will
not be notified of any error by an SGML application.
4. Finally, SGML is not concerned with the semantic
appropriateness of generic identifiers (the names of
elements). To use the label systematically to
tag etymological information and systematically
to tag pronunciation information would be monumentally
confusing to a human, but it would not be an SGML error,
because generic identifiers are merely arbitrary signs
as far as an SGML system is concerned.
These SGML-irrelevant errors invite us to ask what sorts
of errors an SGML system should monitor, and the obvious
answer is that SGML is concerned with the syntax of
documents, and, specifically, with verifying that the
syntax of a document instance obeys the syntactic rules
established in an associated DTD. Deviations from the
DTD within the document instance constitute SGML errors,
and must be reported as such by a validating SGML parser.
Those who work with SGML texts are used to thinking
of valid SGML as the inevitable goal of our encoding
projects, and we assume that error messages are generated
by a parser to alert us to the presence of faulty data,
which we then normally repair before publication or
further processing. We are not conditioned to think
of syntactically invalid SGML as a natural or desirable
state, or as a practical or appropriate way of representing
syntactically contradictory source data.
Practical Solutions
Assumptions
My evaluation of the various practical solutions to
dealing with anomolies in source documents is based
on two assumptions. First, I assume that where the meaning
or function of character data is unambiguous, that data
should be transcribed faithfully. Editorial comments
about errors in the original source document are not
excluded, but for reasons discussed below, these should
be restricted to markup, rather than introduced by altering
the character data content of the document. Second,
I assume that a DTD is an interpretive statement, rather
just an engineering convenience. A DTD represents the
editor's analysis of the structure of a document, as
inferred from document analysis. If we find errors in
our source document, a faithful representation of the
source document would show a structure with violations,
because to represent the document as structurally consistent
would contradict the conclusions implied by document
analysis.
Editorial Correction
The first of the solutions noted above, changing text
during transcription to correct errors in an original
source document, is unattractive because transcriptions
of (to continue the original example) existing print
dictionaries have two essences: they are new and functional
electronic dictionaries and they are electronic records
of existing archµographic materials, viz. print dictionaries.
While correcting errors observes the spirit of the first
of these essences, since it produces a more useful and
practical electronic reference work, it runs counter
to the second, in that it simply suppresses information
about the original source document.
The TEI (Text Encoding Initiative) DTD addresses a superficially
related problem: the treatment of anomalous character
data during textual transcription. For example, errors
in character data in original sources may be tagged
as , with the correction stored in a corr attribute.
Alternatively, the correction may be entered as content
in a element, with the original reading stored
in a sic attribute. But despite the superficial similarity
between anomalous character data and structural anomaly,
SGML does not readily support markup of markup through
the use of attributes, and the TEI DTD contains no comparable
proposal for wrapping anomalous structures in an umbrella
element that would facilitate the specification of two
types of markup as markup, one anomalous (as found in
the original source) and the other logically correct
but philologically unfaithful.
The Loose DTD
The second alternative, creating a loose DTD that does
not enforce a strict element order (by, for example,
replacing the commas with ampersands in the content
model portion of the element declaration above, or making
all elements of a content model optional, so as to cater
to accidental omissions in the source), is unattractive
because it sacrifices structural information. If the
evidence of document analysis clearly points to the
transgression in a particular place of a structure that
otherwise observes a strict order, rather than to conformity
to a loose structure that does not require strict order,
a DTD based on the latter conceals information about
the document, viz. what is regular and what is exceptional.
If the DTD is viewed as a formal model of the editor's
interpretation of the structure of a document, it is
properly a potential object of study in its own right,
and suppressing the distinction between the regular
and the exception distorts the model.
One might wonder, when confronted with a very small
number of articles in the dictionary that share some
structural peculiarity, whether they are a) a very rare
but perfectly valid variation on the standard pattern,
b) errors, or c) the sole remaining evidence that the
compilers of the original dictionary had a mental model
of the dictionary entry that permitted far more variation
than they (almost) ever actually used?
This question is a particular application of a general
problem that arises whenever we transcribe existing
print sources. One advantage commonly cited for descriptive
over procedural markup is that procedural markup may
neutralize and conflate structurally different elements.
For example, printed document may use italics to represent
emphasis, book titles, and foreign words, all of which
may occur in some of the same environments. SGML does
not prohibit the use of procedural markup tags such
as , but most approaches to encoding such texts
for humanities research would assign descriptive tags
such as , , and to these pieces
of text. In most cases the function of italic type in
a specific context in a printed source document is obvious,
but apparent ambiguities are possible, and it may not
be easy to describe the algorithm a human editor would
use to resolve them.
In the problem cited above, assuming a very large volume
of consistent material, I would rule out (a) unless
I could identify a reason for the dictionary compiler
to have deviated deliberately from a pattern followed
consistently elsewhere. And I would usually rule out
(c) as possibly true but uninteresting, in that it is
always possible that the compiler had a mental model
that was broader even than anything actually printed
in the dictionary. Interpreting typographic features
structurally suffers from the same limitations as any
historical reconstruction, and historical records are
often faulty. Our decision in the present case may come
down to a subjective conclusion about whether it is
more probable that the compiler had a loose mental model
than that he committed a small number of errors.
The Escape Hatch
The third alternative, based on a combination of a structured
DTD plus an escape hatch for deviant data, may be the
best approach from an engineering perspective. This
is the solution adopted by the TEI DTD, which for transcriptions
of print dictionaries distinguishes , a well-structured
lexical entry, from , which may contain any
dictionary elements in any combination, and therefore
caters to lexical entries that violate normal structural
constraints. One limitation of the TEI solution, however,
is that it is applicable specifically to print dictionaries,
even though these are not the only documents that may
contain structural errors, since other types of preexisting
documents may also omit or misplace elements through
oversight. One might generalize the TEI solution by
creating <ûfree> counterparts for other elements, or
by creating an element , which may occur anywhere
and contain anything, but with this approach the fact
that the content of such elements is erroneous, rather
than correctly and appropriately unusual or unconstrained,
becomes purely a semantic matter, even though by nature
the error is syntactic. That is, markup in general is
supposed to represent the syntax of the document, but
the syntactic error is represented not by syntactically
erroneous markup, but by the extrasyntactic meaning
assigned to specific elements. Since the principal thing
SGML understands is the distinction between syntactic
validity and invalidity, is there not something philosophically
unsatisfactory about this dislocation?
In Defense of Invalid SGML
The solution that best preserves the integrity of the
original source, encodes explicitly the difference between
norm and violation of norm, and represents syntactic
anomaly in the source as syntactic anomaly in the document
instance is the fourth alternative, the creation of
an invalid document that violates its DTD. This solution
is unavailable and, indeed, almost kinky or subversive
in an SGML context because our conditioned perception
of SGML as a way of modeling document structure assumes
both that documents should be structured and that a
document's structure should be represented by its DTD.
The weakness in this perspective is illustrated by the
examples cited above from the Oxford English Dictionary:
documents created in obvious conformity to a very explicit
structure (such as dictionaries), but created without
the assistance of SGML tools, may occasionally violate
their structure through human error. These violations
are informational, at least from an archµographic perspective,
and should be preserved. And what should be preserved
is not merely that the offending portions observe a
different but unremarkable structure, and certainly
not that the overall document structure is generally
loose. What should be preserved is what document analysis
reveals: the document has a highly-structured implicit
DTD and the offending portions are conceptual violations
of this DTD, rather than alternative valid structures.
Although SGML distinguishes valid from invalid syntactic
structures, and although documents may contain a smattering
of true (authentic, imported from a source and requiring
preservation) syntactically invalid structures amid
a sea of valid ones, the "escape hatch" solution preserves
the distinction only by translating it from the syntactic
to the semantic.
The requirement that SGML be syntactically valid seems
in most contexts so obvious that it would rarely be
questioned, but if document analysis of existing documents
reveals violations of the structure identified through
document analysis, the most appropriate model of this
information in SGML terms involves invalid SGML. If
we are driven to an "escape hatch" decision because
the creation of invalid SGML is foreclosed for practical
reasons, we should not lose sight of three things: a)
violations of basic structure are syntactic, b) what
we are encoding in elements like are the
equivalent of anticipated parser error messages, and
c) the fact that a document may violate its basic structure
in specific places is informational.
The Role of the SGML Parser
The preceding three thoughts are supported by considerations
rooted in questions of the philosophy of markup, but
from an engineering perspective they pose several obvious
practical obstacles, derived from the inability of current
SGML software to deal productively with violations within
a document instance of the syntax specified by the associated
DTD. Among these problems:
1. Parser error messages are not standardized, which
complicates the automated processing of such messages
across systems.
2. The type of intentional error described above is
not easily distinguished from unintentional errors that
the user will not wish to preserve or process, and that
should simply be corrected when reported during document
preparation.
3. Parser error messages may not identify the exact
syntactic nature of the error in a useful way. For example,
a parser may be unable to distinguish a transposition
from a combination of omission and insertion.
SGML parsers may be able to recognize some types of
errors well enough to generate informative error messages,
recognizing, in effect, both a correct grammar that
raises no errors and a looser grammar that includes
a number of incorrect constructions that the parser
can nonetheless identify. This raises the possibility
of using the notion of multiple grammars to enrich parser
output, so that a document might, for example, be evaluated
not only as correct or incorrect, but as correct or
incorrect in a variety of specific ways. Elements that
have been omitted or included erroneously are already
recognized by such parsers as nsgmls; recognizing elements
that are misplaced as dislocations, rather than unexpected
omissions combined with unexpected insertions, may prove
more difficult.
In an authorizing environment enriched in this way,
the system might query the user upon encountering a
parsing error for the first time. The user would either
correct the error or inform the system of how such erroneous
structures should be mapped automatically in the future
to valid structures. The interjection of this type of
associative layer into the model allows the document
instance to preserve the syntax of the original, it
allows the DTD to model the abstract structure underlying
the original even when that structure is not followed
with absolute fidelity, and it provides a format where
users can specify in formal ways the relationships between
ideal and actual markup without compromising the integrity
of either the transcription of the primary source or
the DTD that purports to represent the syntactic structure
underlying that source.
Acknowledgements
An earlier version of this paper was presented under
the title In Defense of Invalid SGML at the 1997 annual
joint meeting of the Association for Computers and the
Humanities and the Association for Literary and Linguistic
Computing (ACH/ALLC) at Queens College, Kingston, Ontario.
I am grateful to John Lavagnino, David Mundie, Michael
Sperberg-McQueen, and Frank Tompa for comments on that
earlier text.
Bibliography
Berg. Donna Lee Berg. 1989. The Research Potential of
the Electronic OED2 Database at the University of Waterloo:
A Guide for Scholars. Research Report OED-89-02. Waterloo:
UW Centre for the New Oxford English Dictionary and
Text Research. See especially the revised electronic
publication based on this work at http://www.epas.utoronto.ca:8080/cch/Berg/Berg-1_Contents.html
Johnson. Samuel Johnson. A Dictionary of the English
Language. Oxford. 1755.
OED. Oxford English Dictionary, Oxford: Clarendon Press,
1884û1928. (Published in fascicles. First edition published
in bound volumes in 1933 and followed by four supplements
1972û86. Second edition 1989.)
TEI. Guidelines for Electronic Text Encoding and Interchange
(TEI P3). Chicago and Oxford: Text Encoding Initiative.
1994.
Winchester. Simon Winchester. The Professor and the
Madman. New York: HarperCollins. 1998.
Young. Matthew Young-Lai. Application of a Stochastic
Grammatical Inference Method to Text Structure. Technical
Report CS-96-36. Waterloo: Department of Computer Science,
University of Waterloo. 1996.