After character conversion, the *.SHW files that were identified for preservation were converted to SGML with custom SNOBOL4 scripts. The SNOBOL4 implementation was Phil Budne's SNOBOL-in-C (C-MAINBOL version 0.99.3).
Modification: The SIL source for SNOBOL set an input record length limit of 132 (originally 80!), which caused problems with some of the scripts used in this project. This length limitation was overcome by changing 132 to 1024 in the CARDSZ define in equ.h, to read
# define CARDSZ (1024) and recompiling.
The only frequency file identified for conversion was FREQSHAW.SHW, a space-delimited report with three columns, consisting of wordform and two sometimes-differing frequency counts (exact meaning to be determined). Sample input looks like:
A 815 808 AVGUSTOV 1 1 AVGUSTOM 1 1
The output of character conversion with the shaw-sgml.rus translit filter looks like:
а 815 808 августов 1 1 августом 1 1
The DTD designed for this report was:
<!doctype pfreq [ <!element pfreq - - (entry)+> <!element entry - - (lexeme,frequency,remainder)> <!element (lexeme | frequency | remainder) - - (#PCDATA)> <!entity % ISOcyr1 SYSTEM 'ISOcyr1.ent'> %ISOcyr1; ]>
<frequency> represents the middle column; two-column frequency files include the second but not third columns, which suggests that this is the one that represents frequency. <remainder> contains the third field. These labels will be changed once the meaning of the fields has been determined conclusively.
Sample SGML output from frequency.sno looks like:
<entry><lexeme>а</lexeme><frequency>815</frequency><remainder>808</remainder></entry>
<entry><lexeme>августов</lexeme><frequency>1</frequency><remainder>1</remainder></entry>
<entry><lexeme>августом</lexeme><frequency>1</frequency><remainder>1</remainder></entry>