From oleg Tue May 9 09:25:21 CDT 1995 Newsgroups: comp.lang.c,comp.lang.c.moderated,comp.infosystems.www.misc Subject: Advanced Finite Automaton as a simple HTML formatter [Crazy Example] Summary: lynx-like HTML formatting by a byte-compiled FSM with "subautomata" Followup-To: comp.lang.c Distribution: Organization: University of North Texas, Denton Keywords: finite state machine, push-down automaton, byte-compilation Cc: Status: RO Finite State Machines appear to hold some fascinating spell: discussion as to how to paint this beast in c/c++ never seems to end. Recently I have stumbled on some surrealistic way of doing it, and thought it might be of some interest. My exhibit is a self-contained ANSI C code that performs a primitive lynx-like formatting of HTML documents using an "advanced" "inline-compiled" state automaton. A peculiar feature of the automaton is an option to "call" a new state instead of just moving to it, with an opportunity to "return" to the caller-state. The html formatter handles '<' '>' '&' special symbol representations,

,


,
, 
, formatting tags, and fills and proper indents the text wherever necessary. The formatter is kind of close to that used in a LYNX web browser for character-based terminals. Try to set it loose on your home page. As was mentioned above, the most notable notable feature is an extensive use of "advanced" finite-state machine with byte-compiled inline actions. I call the machine "advanced" because it is actually a push-down automaton in disguise. HTML is a context-free language (language of properly balanced parentheses), so one has to use some kind of stack if he wants to get by with the fewer than infinite number of states. So, lynx.c has a stack, but of a peculiar kind: a stack of automaton states. Thus, in addition to a regular state transition (jump), it's possible for a state to "call" another state as a "subroutine", and get the control back after the subroutine finishes (performs "return from a subautomaton"). The subroutine can also do "longjmp" with the corresponding stack adjustment. The actions of the automaton are "byte-compiled" by the C compiler itself. For example, the following piece of code Arc N_filling [] = { {' ',{CALL,output_space,EOA},&N_space}, {'\t',{CALL,output_space,EOA},&N_space}, {'\n',{CALL,output_space,EOA},&N_space}, {'&',{NPUSH,&N_fill_quote,EOA},&N_amp}, /* Handle & in a subroutine */ {'<',{NPUSH,&N_filling,EOA},&N_tag}, /* Handle tag in a subroutine*/ {ANY,{WRITEC,EOA},&N_filling} }; says that when the automaton is in state 'N_filling' (that is, performs filling of a current paragraph) and sees a space as a current character, it should "CALL" a C function output_space() and jump to a state N_space; if the current character is '<' the automaton calls state N_tag as a "subroutine". The automaton outputs any other current character and remains in the N_filling state. CALL, NPUSH, WRITEC etc are all byte-codes, which have been defined, say, like #define CALL (void *)7 /* Call the function specified by*/ /* the next bytecode */ They are interpreted by a special action-interpreter. BTW, the interpreter code sports one of my best one-liners: void ** actionp; ....... case CALL: (*((void (*)())*++actionp))(); The last line is _really_ part of the code, and it really works, despite its freaking bizarre look. One reason I like C there is no limit to obfuscation whatsoever... The code correctly compiles (and runs afterwards) on UNIX with gcc 2.6.3 and PC/MSDOS using Turbo C 3.1. BTW, The code was the answer to a problem offered at a 1995 Dallas area high school programming contest, sponsored by UNT chapter of the ACM. The text of the problem assignment is at the beginning of the code title comments. Please forgive elementary tutorial-like explanations in the assignment: it was intended, after all, for high-school students. The code has been posted to comp.sources.misc, and is also available from [see the web page] I'll appreciate any comment/question etc and will do my best to answer/clear them; please mail them to me at [see the web page]