From oleg Tue May 9 09:25:21 CDT 1995
Newsgroups: comp.lang.c,comp.lang.c.moderated,comp.infosystems.www.misc
Subject: Advanced Finite Automaton as a simple HTML formatter [Crazy Example]
Summary: lynx-like HTML formatting by a byte-compiled FSM with "subautomata"
Followup-To: comp.lang.c
Distribution:
Organization: University of North Texas, Denton
Keywords: finite state machine, push-down automaton, byte-compilation
Cc:
Status: RO
Finite State Machines appear to hold some fascinating spell:
discussion as to how to paint this beast in c/c++ never seems to
end. Recently I have stumbled on some surrealistic way of doing it,
and thought it might be of some interest. My exhibit is a
self-contained ANSI C code that performs a primitive lynx-like
formatting of HTML documents using an "advanced" "inline-compiled"
state automaton. A peculiar feature of the automaton is an option to
"call" a new state instead of just moving to it, with an opportunity
to "return" to the caller-state.
The html formatter handles '<' '>' '&' special symbol
representations,
,
, ,
,
formatting tags, and fills and proper indents the text wherever
necessary. The formatter is kind of close to that used in a LYNX web
browser for character-based terminals. Try to set it loose on your
home page.
As was mentioned above, the most notable notable feature is an
extensive use of "advanced" finite-state machine with byte-compiled
inline actions. I call the machine "advanced" because it is actually a
push-down automaton in disguise. HTML is a context-free language
(language of properly balanced parentheses), so one has to use some
kind of stack if he wants to get by with the fewer than infinite
number of states. So, lynx.c has a stack, but of a peculiar kind: a
stack of automaton states. Thus, in addition to a regular state
transition (jump), it's possible for a state to "call" another state
as a "subroutine", and get the control back after the subroutine
finishes (performs "return from a subautomaton"). The subroutine can
also do "longjmp" with the corresponding stack adjustment.
The actions of the automaton are "byte-compiled" by the C
compiler itself. For example, the following piece of code
Arc N_filling [] =
{
{' ',{CALL,output_space,EOA},&N_space},
{'\t',{CALL,output_space,EOA},&N_space},
{'\n',{CALL,output_space,EOA},&N_space},
{'&',{NPUSH,&N_fill_quote,EOA},&N_amp}, /* Handle & in a subroutine */
{'<',{NPUSH,&N_filling,EOA},&N_tag}, /* Handle tag in a subroutine*/
{ANY,{WRITEC,EOA},&N_filling}
};
says that when the automaton is in state 'N_filling' (that is,
performs filling of a current paragraph) and sees a space as a current
character, it should "CALL" a C function output_space() and jump to a
state N_space; if the current character is '<' the automaton calls
state N_tag as a "subroutine". The automaton outputs any other current
character and remains in the N_filling state. CALL, NPUSH, WRITEC etc
are all byte-codes, which have been defined, say, like
#define CALL (void *)7 /* Call the function specified by*/
/* the next bytecode */
They are interpreted by a special action-interpreter. BTW, the
interpreter code sports one of my best one-liners:
void ** actionp;
.......
case CALL:
(*((void (*)())*++actionp))();
The last line is _really_ part of the code, and it really works,
despite its freaking bizarre look. One reason I like C there is no
limit to obfuscation whatsoever...
The code correctly compiles (and runs afterwards) on UNIX with gcc
2.6.3 and PC/MSDOS using Turbo C 3.1.
BTW, The code was the answer to a problem offered at a 1995 Dallas
area high school programming contest, sponsored by UNT chapter of the
ACM. The text of the problem assignment is at the beginning of the
code title comments. Please forgive elementary tutorial-like
explanations in the assignment: it was intended, after all, for
high-school students.
The code has been posted to comp.sources.misc, and is also available
from [see the web page]
I'll appreciate any comment/question etc and will do my best to
answer/clear them; please mail them to me at [see the web page]