Serving Non-Latin1 Web Documents in an Eight-Bit World

David J. Birnbaum (djb@clover.slavic.pitt.edu)
Yulia Chugunova

Copyright © 1996 by David J. Birnbaum and Yulia Chugunova.
All rights reserved.

Abstract

HTML supports only the ISO 8859-1: 1987 character set (sometimes identified informally as ISOLatin1).[1] In the absence of any standard protocol for serving documents that do not conform to the ISOLatin1 character set on the World Wide Web, a variety of improvisions has been developed by both commercial concerns and communities of users. The present report surveys these systems, discusses their advantages and disadvantages, and then describes a new strategy that makes non-ISOLatin1 documents accessible to a larger community of users than any alternative currently available.

Introduction

The World Wide Web is an effective environment for platform-independent document interchange because it is based on a client/server model in which both client and server are constrained by open standards. (add in main text: who is the w3 consortium and what's available on their servers?) [2]

A document-interchange model based on public standards provides obvious advantages in an environment where unknown servers must communicate with unknown clients: as long as both server and client adhere completely to the HTTP and HTML standards, effective communication is ensured. Deviation from standards on the part of either server or browser, on the other hand, compromises the integrity of the communication; in order for all information to be transmitted successfully, a browser must be able to parse all conformant HTML markup and a server should not deliver non-HTML markup that a browser is unable to parse.[3]

A standards-based client/server model necessarily imposes limitations, some of which may be unacceptable to users in certain environments. The HTML standard undergoes constant review and revision in an attempt to address inadequacies, but maintaining a standard involves delays, and users may find themselves forced to choose between implementing a non-standard solution to a problem not addressed in the standards or seceding from the HTML community. For example, the character set supported by the HTML 2.0 standard is ISO 8859-1, known as ISOLatin1, which contains the characters required to represent most western European writing systems. [4] The HTML standard does not support eastern European and other Latin-alphabet writing systems that require characters not included in ISOLatin1, and it also does not support writing systems based on non-Latin alphabets.

The writing systems of the world are not all represented by ISOLatin1, and the name "World Wide Web" is, in a certain sense, contradicted by the lack of support for character sets other than ISOLatin1 in HTML; the whole world is able to participate over the world-wide Internet, but only in a subset of the world's languages. This lack of support for many national writing systems has compelled users to develop non-standard ways to represent writing systems not based on ISOLatin1, subverting, in the process, both the client/server model underlying HTTP and the useful distinction between character and glyph. The present report discusses the strengths and weaknesses of current user solutions to the lack of Cyrillic support in the HTML standard, outlines how a technically adequate solution could be implemented in an eventual Unicode-based revision of the HTML standard, and then describes a system that has been implemented in the current eight-bit environment to improve the accessibility of Cyrillic documents on the World Wide Web.

Current Solutions

The predominant character-coding system used for Russian materials on the World Wide Web is a Russian national standard known as KOI8. [5] KOI8 is an eight-bit character set that includes ASCII (= ISO/IEC 646 IRV ADD YEAR) in the lower 128 cells and modern Slavic Cyrillic in the upper 128.[6] Most World Wide Web browsers can be configured to use KOI8 fonts, and users who have installed this support will see Cyrillic glyphs in their display even though their browser "thinks" it is representing accented ISOLatin1 characters.

This transparent Cyrillic support is achieved at significant practical and theoretical expense:

  1. The standard-based client/server architecture underlying the World Wide Web means that the character set supported by the HTML standard must be supported by both the server and the browser, so that a user with a standard, unmodified browser is guaranteed a legible display of any HTML-conformant text. Where Cyrillic information is transmitted in KOI8 encoding, only users with (non-HTML-conformant) KOI8 support on their browsers will be able to access the resources effectively. In other words, the use of KOI8 encoding guarantees that users with fully conformant but unmodified World Wide Web browsers will not be able to render Cyrillic documents legibly.
  2. The HTML standard assumes that fonts will be used for typographic purposes, and not for character set support; regardless of the font selected by the user, HTML expects that the entire ISOLatin1 character set will be supported. The KOI8 character set supports ASCII plus modern Slavic Cyrillic, which means that users who require only modern Slavic Cyrillic writing systems plus English can install KOI8 fonts in their browsers and display all alphabetic glyphs used in these writing systems. But users who require both accented Latin-alphabet glyphs and Cyrillic are out of luck, because Cyrillic in KOI8 replaces the accented Latin characters of ISOLat1, which means that a document that uses only standard HTML-conformant tags cannot render accented Latin-alphabet and Cyrillic materials simultaneously.
  3. HTML relies on a specific font encoding vector for non-alphabetic characters. For example, users who install KOI8 fonts often discover that what should be displayed as a bullet in an unordered list (<UL>) is instead displayed as a Ukrainian Cyrillic letter. Ideally, Cyrillic support should be implemented in a way that does not interefere with other system resources, including non-alphabetic resources.
  4. A KOI8 font in a web browser environment subverts the standard use of SGML entities. HTML defines an inventory of SGML entities, such as &aacute; for a with an acute accent (á) or &copy; for a copyright symbol (©); these are translated internally into pointers to transASCII positions in fonts. Because HTML cannot mix character sets in a document, user-installed KOI8 fonts necessarily replace the standard fonts, which means that Cyrillic glyphs are displayed in place of the characters represented by the entity names. This is a more complex problem than the preceding ones; there is no inherent reason why character (pick an example) should represent Latin (blah) or Cyrillic (blah), but there is a very good reason why &copy; should represent a copyright mark and not an alphabetic letter, Latin or Cyrillic.

There is, on the other hand, an extremely powerful argument in favor of the KOI8 font type of approach: with all its limitations, it is the only working solution in wide use. That it is not HTML-conformant might not be a significant consideration for users who need to see Cyrillic in their browsers, a goal that cannot currently be achieved in any HTML-conformant way because HTML simply does not support Cyrillic.

Users who lack KOI8 system support, or who are unwilling to accept the loss of ISOLatin1 support as the price for KOI8, have several alternatives, none of them satisfactory. First, they can used an ASCII-based encoding system to represent Cyrillic, such as George Fowler's proprietary encoding (based on a one-to-one correspondence between Cyrillic and ASCII characters) or David J. Birnbaum's modification of the Library of Congress Russian transliteration system (which includes fewer ambiguities than the original Library of Congress system, but which nonetheless is not suitable for automated bidirectional transliteration).[7] These systems use only ASCII characters, and therefore will work as advertised in any unmodified web browser.[8] They impose certain costs, however: they do not distinguish ASCII (Latin characters) from ASCIIfied Cyrillic, they are not official standards, and both involve unfamiliar and possibly unintuitive mappings that the user must take the trouble to learn.

Alternatively, web servers could be configured to deliver Cyrillic materials as PostScript documents with embedded ("downloadable") fonts. Such documents would have the advantages of following an established standard, not interfering with the rendering of non-Cyrillic materials in the browser, and being legible to all users with PostScript support (readily available on a variety of platforms through the free and freely-distributable GhostScript and GhostView). The substantial disadvantage is that PostScript documents cannot contain HTML links to other URLs, which means that users who receive PostScript documents cannot use them to access other web resources through <a href="url"> markup. (ADD PDF DISCUSSION.) PostScript documents are also likely to be significantly larger than regular HTML, since they will need to include embedded ("downloadable") fonts, and the PFA encoding of Adobe's Minion Cyrillic Latin/Cyrillic PostScript font (for example) runs to approximately 100k. And although PostScript support is readily available, it nonetheless does require an external viewer, which means that although it is HTTP-conformant, it is not HTML-conformant.

Web servers can also be configured to deliver Cyrillic documents as image files. Static documents could be stored as GIF or JPEG images; dynamic Cyrillic data could be piped through GhostScript on the server platform, generating an image file that could then be converted to GIF or JPEG, inserted into an HTML document, and delivered to the browser. This approach improves on the preceding by not requiring that browser platforms support PostScript directly, but it is of little use in text-only browsers such as Lynx, or in graphic browsers with image support disabled by the user in response to a slow modem connection.

One final complication, especially with KOI8 and other true Cyrillic configurations, is that with some systems it may be easier to install Cyrillic font support than Cyrillic keyboard support. Users with KOI8 font support will be able to read KOI8 Cyrillic pages, but they will be unable to interact with KOI8 HTML forms unless they also have KOI8 keyboard support.[9] A flexible system might treat rendering support (fonts) separately from input support (keyboards).

An Ideal Solution

HTML supports a fixed inventory of characters, and attempts to circumvent this inventory by changing fonts inevitably leads to unacceptable consequences. Extending the HTML standard to support font changes within elements is not an acceptable solution because the problem is one of character sets (inventories of characters, or informational units) rather than of fonts (inventories of glyphs, or presentational units). That is, one usually changes font as a way of changing typeface or weight, and it would be peculiar, to say the least, to treat the difference between Latin and Cyrillic as equivalent to the difference between Times and Helvetica.[10] Within the current eight-bit environment, a <charset> tag might be a suitable approach, and one reason this may not have been proposed is the widespread lack of understanding of the architectural difference between character sets and fonts.

Unicode, which is a subset of ISO/IEC 10646-1: 1993 (CHECK YEAR), is a sixteen-bit global multilingual character set. Unicode views Cyrillic and Latin characters as different on the character level, which means that it does not use the same bit combination to represent different characters in a single text stream. This approach avoids the problem of overloading the semantics of individual bit combinations; a specific combination will always point to a specific coded character (Latin, Cyrillic, or other). This system also avoids the problem of compromising the integrity of entity references (such as &copy;), since an entity reference will always point unambiguously to a single character with constant character-set semantics.[11]

The Best Current Solution

Unicode is relatively new technology and it will be several years until the Internet community will be able to rely on a World Wide Web built on a sixteen-bit global multilingual character set. Until that time, users require a solution to the non-ISOLatin1 character problem that avoids the unacceptable consequences of currently existing solutions.

Because HTML does not support non-ISOLatin1 alphabetic characters, no truly satisfactory standard solution is currently available. It is nonetheless the case that individual users may prefer one or another of the available solutions, and may even be totally unable to use certain options. If we assume that the purpose of web publication is maximizing access to information, the best solution in the current non-standard environment is the one that serves the needs of as many users as possible, even when these users support Cyrillic in different ways. Web publishers can reach this maximally broad audience not by selecting the single most popular system for delivering Cyrillic documents, whatever that may prove to be, or their personal favorite, but by supporting several such systems simultaneously, allowing individual users to select the compromises that they find least unacceptable.

With these issues in mind, the authors have constructed a CGI system that allows users to select independently from a menu of encoding vectors for user input and system output. An HTML table presents the user with a list of input encodings (for user input in forms) and output encodings (for documents returned to the user in response to queries), along with documentation (see figure 1).


Figure 1: Sample Encoding Table
Encoding QueryReport
Modified LC Query Report
Fowler Query Report
ISOcyr1.ent Query Report
KOI-8 Query Report
Alternative Query Report
CP 1251 Query Report
PostScript not available Report
GIF Image not available Report

We omit PostScript and GIF image input because these options make sense only as rendering formats. We separate input and output configuration options because users who have access to KOI8 fonts but not to KOI8 keyboard resources (for example) may wish to select one of the ASCII input encoding vectors (modified LC, Fowler, or ISOcyr1.ent) but KOI8 reporting, and users who require PostScript or image output will necessarily have to select a different query encoding vector.

Our query forms use the CGI POST method to transmit all user input, including the selected encoding vectors, as STDIN. User input is parsed and the input encoding is used to select from a set of mapping tables that mediate between the user's input encoding and the internal storage encoding (SGML character entities in the present case, see figure 2).


Figure 2: ISOCyr1.ent Cyrillic Encoding (excerpt)
LetterUpper
Case
Lower
Case
a&Acy;&acy;
b&Bcy;&bcy;
v&Vcy;&vcy;

The user's text input is then piped through translit, a free and freely distributable program that remaps text according to a user-configurable table (see figure 3).[12]


Figure 3: Translit Mapping Table (excerpt)

     0         "A"              0       "&Acy;"    A
     0         "B"              0       "&Bcy;"    Be
     0         "V"              0       "&Vcy;"    Ve

translit understands regular expressions and plain strings, and is capable of implementing any coherent remapping quickly and effectively, without significant degradation in response time, at least for brief texts. The remapped user input is then processed in the encoding native to the system (the standard ISOcyr1.ent registered entity set, in the present case; see Figure 4), whereupon reports are rendered by piping the data through translit after selecting the appropriate output encoding mapping file.[13]


Figure 4: SGML Source Document (excerpt),

<verb aspect="nsv-sv" conjugation="2" presentstress="apresent"  TRANSITIVITY="transitive" MOBILE="nomobile" PASTSTRESS="apast" STEMCONSONANT="nostem" PPP="noppp" IO="noio">
  <stressed>&acy;&bcy;&lcy;&acy;&kcy;&tcy;<stress>&icy;</stress>&rcy;&ocy;&vcy;&acy;&tcy;&softcy;</stressed>
  <unstressed>&acy;&bcy;&lcy;&acy;&kcy;&tcy;&icy;&rcy;&ocy;&vcy;&acy;&tcy;&softcy;</unstressed>
</verb>

It would not be practical to maintain a large inventory of documents in multiple encodings, not only because of storage requirements, but also because of the difficulty of synchronizing the variants during updates and maintenance. Our dynamic generation of output according to a user-selected encoding method avoids the problem of multiple storage formats, a problem that would be even more acute with dynamic than static documents.

Conclusion

No HTML-conformant system is able to overcome the unfortunate fact that HTML does not support Cyrillic in any natural way. But our system surpasses any other we have encountered by giving the user a choice among de facto standards, providing, for example, KOI8 and other popular Cyrillic renderings for those who require a Cyrillic display, but also three different ASCIIfied renderings plus PostScript and image files for those who lack Cyrillic browser support. Support for additional encoding formats requires only the creation of appropriate mapping tables.

Until Cyrillic support on the World Wide Web becomes as standardized as ASCII support, and until Cyrillic character set and font resources can be implemented without breaking other parts of an HTML system, we believe that our interface offers the best available method for maximizing the accessibility and utility of Cyrillic web resources, and that all of the alternatives we have encountered support a significantly smaller user base. An additional advantage of our system is that it is easily extended to other writing systems; for example, it could provide users who require Greek support with a choice between ISO 8859-7 (ASCII/Modern Greek) or the Greek "beta code" system favored by classicists.[14]


Notes

[1] Currently undergoing revision by the ISO. Any revisions will not affect the character inventory or arrangement.

[2] In this report, the term server refers both to the HTTP server application and to the documents it delivers to the browser. That is, I identify as server-based problems both server software that mishandles correct HTML markup and documents that are created with incorrect HTML markup and then delivered by the server software as written.

[3] A related problem is the semantic misuse of markup by some HTML authors to achieve specific rendering effects, such as using <DD> tags to control indentation in contexts that are not structurally related to definition lists. <DD> is a legitimate tag that all HTML-conformant browsers must be able to parse, but it is semantically legitimate only when it is used to identify certain components of definition lists. The abuse of <DD> as a presentation tag representing indentation is inappropriate because the HTML standard does not dictate the rendering format of <DD>, and, in particular, it does not require that this tag be rendered by indentation. Rather, the standard requires only that this tag be implemented in a way that is consistent with how humans expect definition lists to look.

The misuse of <DD> to control indentation, rather than to indicate structure, leads to the proliferation of warnings like "This document looks best when viewed with Netscape," an appalling subversion of standard-based client/server architecture. If the goal of web publication is the dissemination of information, it is counterproductive to mark up documents in a way that is not understood by all conformant browsers. If the markup in question contributes information, it should be implemented in a conformant manner so that all readers will be able to access it. If it does not contribute information, it should be omitted.

[4] The HTML standard also supports a small number of additional non-alphabetic characters, such as the copyright sign ©.

[5] source and information

[6] Of the modern Slavic languages, Russian, Ukrainian, Belarusian, Bulgarian, Macedonian, and Serbian (formerly considered part of Serbocroatian) are written in Cyrillic. KOI8 includes all of the alphabetic characters used in these writing systems except Ukrainian "hard g". CHECK THIS

[7] See http://clover.slavic.pitt.edu/~djb/cz/modified_lc_encoding.html. Specifically, shch is ambiguously either one shch character or sh followed by ch (as in vesnushchatyj "freckled"), and upper- and lower-case hard and soft sign are not distinguished (" and ', respectively). If jo is used to represent e with diæresis, it becomes indistinguishable from j followed by o.

[8] One restriction is that the Fowler encoding system uses < to represent stress, which violates the HTML expectation that < will represent only the onset delimiter of a tag. Since most Russian text does not mark stress, this is not usually a significant practical limitation.

[9] Or the patience to enter text through the numeric keypad.

[10] Implementing Cyrillic support as a font change might seem to solve the rendering problem, but HTML documents are used for more than rendering. An indexing robot would want to treat text identically regardless of whether it was rendered in Times or Helvetica, but it would ideally want to index look-alike Latin and Cyrillic text differently.

[11] An entity reference already refers unambiguously to a single character as long as one remains within the HTML standard, but, as was noted above, the non-standard use of non-ISOLatin1 fonts disrupts this mapping.

[12] translit is available at ftp://www.ccl.net/pub/central_eastern_europe/russian/translit.

[13] Usual CGI security is warranted to avoid putting unknown and potentially harmful text on a system command line.

[14] Our system does not address sixteen-bit character sets or right-to-left writing systems.