The file lynx.c is a self-contained ANSI C code that performs
a very primitive formatting of HTML documents using an "advanced"
"inline-compiled" state automaton. The code was the answer to a
problem offered at a 1995 Dallas area high school programming contest,
sponsored by an UNT chapter of the ACM. The text of the problem
assignment is at the beginning of code's title comments. Please
forgive elementary tutorial-like explanations: it was intended, after
all, for high-school students.
The html formatter handles '<' '>' '&' special
symbol representations,
,
, ,
, formatting tags, and fills and proper indents the text wherever
necessary. All other formatting tags are ignored (considered as
invisible comments). The formatter is pretty close to that used in a
LYNX web browser for character-based terminals.
The most notable notable feature of the code is an extensive
use of "advanced" finite-state machine with byte-compiled inline
actions. I call the machine "advanced" because it is actually a
push-down automaton in disguise. HTML is a context-free language
(language of properly balanced parentheses), so one has to use some
kind of stack if he wants to get by with the fewer than infinite
number of states. So, lynx.c has a stack, but of a peculiar kind: a
stack of automaton states. Thus, in addition to a regular state
transition (or jump), it's possible for a state to "call" another
state as a "subroutine", and get the control back after the subroutine
finishes (performs "return from a subautomaton"). The subroutine can
also do "longjmp" with the corresponding stack adjustment.
The actions of the automaton are "byte-compiled" by a C
compiler itself. For example, the following piece of the code
Arc N_filling [] =
{
{' ',{CALL,output_space,EOA},&N_space},
{'\t',{CALL,output_space,EOA},&N_space},
{'\n',{CALL,output_space,EOA},&N_space},
{'&',{NPUSH,&N_fill_quote,EOA},&N_amp}, /* Handle & in a subroutine */
{'<',{NPUSH,&N_filling,EOA},&N_tag}, /* Handle tag in a subroutine*/
{ANY,{WRITEC,EOA},&N_filling}
};
says that when the automaton is in state 'N_filling' (that is,
performs filling of a current paragraph) and sees a space as a current
character, it should "CALL" a C function output_space() and jump to
the state N_space; if the current character is '<' the automaton calls
the state N_tag as a "subroutine". The automaton writes any
other current character and remains in the N_filling state. CALL,
NPUSH, WRITEC etc are all byte-codes, which have been defined, say, like
#define CALL (void *)7 /* Call the function specified by*/
/* the next bytecode */
They are interpreted by a special action-interpreter. BTW, the
interpreter code sports one of my best one-liners:
void ** actionp;
.......
case CALL:
(*((void (*)())*++actionp))();
The last line is _really_ part of the code, and it really works,
despite its freaking bizarre look. One reason I like C it doesn't
limit obfuscation at all...
The code correctly compiles (and runs afterwards) on UNIX with
gcc 2.6.3 and PC/MSDOS using Turbo C 3.1. Please ignore the flood of
"initialization from incompatible pointer type" warnings: there _are_
really very many of them, but they're all _harmless_ (a consequence of
a wild casting, say, char to void *, or a function pointer to void *
and back ).
To try the code out, take any HTML file (for example, your
home page), rename is as html.in, and set the lynx loose. Having a
fixed file name, html.in (which has to be in the same directory as the
lynx executable) is indeed silly, but that's was easier for
judging. It shouldn't be that hard to modify the code to get it to
take the name of an input file from the command line.
I will appreciate all comments/questions