The file lynx.c is a self-contained ANSI C code that performs a very primitive formatting of HTML documents using an "advanced" "inline-compiled" state automaton. The code was the answer to a problem offered at a 1995 Dallas area high school programming contest, sponsored by an UNT chapter of the ACM. The text of the problem assignment is at the beginning of code's title comments. Please forgive elementary tutorial-like explanations: it was intended, after all, for high-school students. The html formatter handles '<' '>' '&' special symbol representations,

,


,
, 
, formatting tags, and fills and proper indents the text wherever necessary. All other formatting tags are ignored (considered as invisible comments). The formatter is pretty close to that used in a LYNX web browser for character-based terminals. The most notable notable feature of the code is an extensive use of "advanced" finite-state machine with byte-compiled inline actions. I call the machine "advanced" because it is actually a push-down automaton in disguise. HTML is a context-free language (language of properly balanced parentheses), so one has to use some kind of stack if he wants to get by with the fewer than infinite number of states. So, lynx.c has a stack, but of a peculiar kind: a stack of automaton states. Thus, in addition to a regular state transition (or jump), it's possible for a state to "call" another state as a "subroutine", and get the control back after the subroutine finishes (performs "return from a subautomaton"). The subroutine can also do "longjmp" with the corresponding stack adjustment. The actions of the automaton are "byte-compiled" by a C compiler itself. For example, the following piece of the code Arc N_filling [] = { {' ',{CALL,output_space,EOA},&N_space}, {'\t',{CALL,output_space,EOA},&N_space}, {'\n',{CALL,output_space,EOA},&N_space}, {'&',{NPUSH,&N_fill_quote,EOA},&N_amp}, /* Handle & in a subroutine */ {'<',{NPUSH,&N_filling,EOA},&N_tag}, /* Handle tag in a subroutine*/ {ANY,{WRITEC,EOA},&N_filling} }; says that when the automaton is in state 'N_filling' (that is, performs filling of a current paragraph) and sees a space as a current character, it should "CALL" a C function output_space() and jump to the state N_space; if the current character is '<' the automaton calls the state N_tag as a "subroutine". The automaton writes any other current character and remains in the N_filling state. CALL, NPUSH, WRITEC etc are all byte-codes, which have been defined, say, like #define CALL (void *)7 /* Call the function specified by*/ /* the next bytecode */ They are interpreted by a special action-interpreter. BTW, the interpreter code sports one of my best one-liners: void ** actionp; ....... case CALL: (*((void (*)())*++actionp))(); The last line is _really_ part of the code, and it really works, despite its freaking bizarre look. One reason I like C it doesn't limit obfuscation at all... The code correctly compiles (and runs afterwards) on UNIX with gcc 2.6.3 and PC/MSDOS using Turbo C 3.1. Please ignore the flood of "initialization from incompatible pointer type" warnings: there _are_ really very many of them, but they're all _harmless_ (a consequence of a wild casting, say, char to void *, or a function pointer to void * and back ). To try the code out, take any HTML file (for example, your home page), rename is as html.in, and set the lynx loose. Having a fixed file name, html.in (which has to be in the same directory as the lynx executable) is indeed silly, but that's was easier for judging. It shouldn't be that hard to modify the code to get it to take the name of an input file from the command line. I will appreciate all comments/questions