From posting-system@google.com Thu Aug 16 18:27:14 2001
Date: Thu, 16 Aug 2001 18:24:24 -0700
Reply-To: oleg@pobox.com
From: oleg@pobox.com (oleg@pobox.com)
X-comment: added a reference to Andrew Appel's paper on Single Static
   Assignment. His paper stated a thesis that a dependency analysis
   of imperative array code is related to extracting a functional
   program from an imperative one. The present article provides
   several illustrations.
Newsgroups: comp.lang.functional
Subject: FP vs assembly programming for super-scalar CPUs [Was: Nested Loops in Clean]
Message-ID: <7eb8ac3e.0108161724.35a3117f@posting.google.com>
Status: OR

A transformation of typical imperative code into FP seems somewhat
similar to the data and control dependency analyses in compilers for
modern CPUs. By typical imperative code we mean a "typical numerical"
code that involves nested loops and updateable arrays. We will
consider three examples of such transformations.

The first example is the one already discussed in this thread. The
example is slightly modified to get the internal loop to do something
useful. In the imperative form, the example reads

for(int i=1; i<=zx; i++)
 for(int j=1, a=i+1, b=i*i; j<=zx; j+=1, a+=1, b+=1)
    for(z=j; z<=zu; z+=2)
       array[z] = a, array[z+1] = b;

The example is actually rather simple. We note that the 'array' is
only written to but never read. There are no forks of control. The
only dependency to worry about is serializing of updates to the
array. The solution then becomes obvious -- we first generate an
"update list" -- a list that tells which element of the array should
be set to which value. After we generate the list we execute the
updates. The generation of the list of updates is done
functionally. We execute the updates imperatively: monadically.  Due
to the lazy nature of Haskell, the elements of the list of updates can
be executed as they are generated.

Here's the corresponding Haskell code.

import ST

-- Generate the list of updates to _an_ array
ex1_update_list zx zu = concat $
  [ [(z,a), (z+1,b)] |
	i <- [1..zx],
	(j,a,b) <- takeWhile (\(j,_,_) -> j<=zx) $
			iterate (\(j,a,b) -> (j+1,a+1,b+1)) (1,i+1,i*i),
	z <- [j,j+2..zu]
  ]
  
-- execute the list of updates, in a serial manner.
-- the executor is "mapM_ updater"
do_loops = do { array <- newSTArray (1,4) 0;
	       mapM_ (\(idx,new_value) -> writeSTArray array idx new_value) $
	             ex1_update_list 4 3;
	       return array }		     

-- Try to run the above and see what we get

array_to_list array = sequence 
 [ readSTArray array i | 
     let (lwb,upb) = boundsSTArray array, i<-[lwb..upb] ]
     
compute = runST( do_loops >>= array_to_list )


You need to use -98 flag of Hugs to load the above code. After that,
you can see the result by typing "compute" at the prompt. Note that in
the functional version of the code, the nested loops look "turned
inside out". What used to be inside of the loop -- an assignment to
the array -- now appears on the outside, as a statement in the
'do_loops'.


The second example is far simpler: given an array[lwb..upb] of
integers, find the sum of the elements of the array that are greater
or equal than 7. Again, we start with the imperative code:

int sum = 0;
for(int i=lwb; i<upb; i++)
   if( array[i] >= 7 )
     sum += array[i];

The analysis of this loop shows the absence of data and control
dependencies on the array. The array is only read from. All elements
of the array are fetched. That means that the loop can be easily
parallelized and all elements of the array can be pre-fetched.

-- "unsafe", speculative prefetching of the elements of the array
-- No data dependencies are assumed
import IOExts(unsafePerformIO)

prefetch_array array = map (unsafePerformIO . stToIO)
  [ readSTArray array i | 
     let (lwb,upb) = boundsSTArray array, i<-[lwb..upb] ]

-- we use the previously defined function do_loops to give us
-- the array to work with
ex3 = sum $ filter (>=7) $ prefetch_array $ unsafePerformIO $ 
            stToIO do_loops

As our analysis indicated, the summing loop is indeed purely
functional. The function prefetch_array "speculatively" prefetches the
elements of the array. In database terminology, the function does a
"dirty read". Note the presence of the 'map' function, which applies a
function to a list in an undeterminate sequence. Due to the absence of
data dependencies, the non-determinism does not matter. Again, the
loop is purely functional.

As the third example we consider a silly sort problem, sorting an
array [0..(n-1)] in the ascending order.

for(i=1; i<n; i++)
{
  bool done_already = true;
  for(j=0; j<n-i; j++)
    if( array[j] > array[j+1] )
    {
      int temp = array[j];
      array[j] = array[j+1];
      array[j+1] = temp;
      done_already = false;
    }
  if( done_already )
    break;
}

The algorithm should work well on small arrays, especially if they are
(almost) in the right order already.

This imperative code is actually quite complex for a super-scalar CPU
to execute efficiently. We see data dependencies across iterations,
reading from and writing into the array, conditional branches that
depend on the elements of the array and the iteration history, control
dependencies between the two loops.

We show two solutions in Haskell. Both use fast, imperative STArrays;
both try to separate the imperative part from a functional part. The
imperative part is performed "in a monad" while the functional part
written in ordinary Haskell. The two solutions are closely related,
and written to emphasize their similarity.

First the most bizarre solution:

data Predicate = P1 -- predicate register 1
	       | P2 -- predicate register 2
	       | Ptrue -- always true
	       | Pfalse -- always false
		deriving Show
			       
-- All instructions refer to a predicate. An instruction is executed
-- only if its predicate is true
data Instruction = LDA Predicate Integer -- Load from loc into accum A
		 | LDB Predicate Integer -- Load from loc into accum B
		 | SET_gt Predicate Predicate -- set predicate if A>B
		 | CLR Predicate Predicate -- clear the predicate
		 | SET Predicate Predicate -- set the predicate
		 | STA Predicate Integer -- Store accum A into loc
		 | STB Predicate Integer -- Store accum B into loc
		 | HALT Predicate
		deriving Show
		 

ex2_instr_stream n = concat
 [[SET Ptrue P2] ++ 
  concat [[LDA Ptrue j, LDB Ptrue (j+1), SET_gt Ptrue P1,
	   STA P1 (j+1), STB P1 j, CLR P1 P2] | j<-[0..(n-i-1)]] ++
  [HALT P2] | i<-[1..(n-1)]]


The solution exhibits a predicate instruction execution -- as found in
the most modern CPUs (e.g., Itanium) as well in Konrad Zuse's
Z-3. It's instructive to type "ex2_instr_stream n" at Hugs prompt and
see the result. We postpone the running of the instruction stream for
a moment to consider the second solution:

ex2_hi_stream n = concat
 [ [\array done_already -> return $ Just True] ++
   [\array done_already -> do {
	a <- readSTArray array j;
	b <- readSTArray array (j+1);
	if a > b then
	   writeSTArray array j b >>
	   ( writeSTArray array (j+1) a >>
	     ( return $ Just False ))
	else return $ Just done_already } | j<-[0..(n-i-1)] ] ++
   [\array done_already -> return $ 
		     if done_already then Nothing else Just done_already]
      | i<-[1..(n-1)]]

This is a higher-level version of ex2_instr_stream. Both
ex2_instr_stream and ex2_hi_stream produce a stream of instructions to
execute. Both functions reduce the nested loop to a stream of
instructions. This is the "functional" part of the solution. The
"imperative" part is the execution of the instruction stream.

-- A variant of fold taking into account an early termination

executor init_state commands array = run init_state commands
  where
       run Nothing _ = return array
       run _ []      = return array
       run (Just state) (cmd:cmds) = (cmd array state) >>=
				\new_state -> run new_state cmds


-- A sample array to sort
sample_array = do {
		array <- newSTArray (0,4) 0;
		writeSTArray array 0 5;
		writeSTArray array 1 7;
		writeSTArray array 2 1;
		writeSTArray array 3 8;
		writeSTArray array 4 9;
		return array }
		

ex2_hi = runST( sample_array >>=
		executor (Just True) (ex2_hi_stream 5)
		>>= array_to_list ) 

It really works: type "ex2_hi" at the Hugs prompt to see the result.

We have mentioned that the high-level instruction stream
generated by 'ex2_hi_stream' is related to the low-level instruction
stream, of ex2_instr_stream. Indeed they are: modulo mapping of the
'alu' function:

ex2_lo = runST( sample_array >>= 
	     executor (Just (0,0,True,True)) (map alu $ ex2_instr_stream 5)
		>>= array_to_list )

where the alu function is the arithmetic-logical unit of our virtual
machine. The function is given in the appendix. The executor function
is the control unit of our virtual CPU.


The analysis of data and control dependencies between loop iterations,
scheduling of the instructions and interlocking of their results,
pre-fetching of data -- is the job of an advanced compiler or the CPU.
On the other hand, we have to do the same kind of analyses to separate
the functional part of a computation from its imperative part.

It appears functional programming is closer to "the bare metal" as one
would've thought.

We can give more examples of data/control dependency analyses of
numerical algorithms (Gaussian elimination, convolutional filtering)
with the goal of re-writing them in a functional language. But this
article is too long and boring already.

P.S.
Andrew W. Appel in an article "SSA is Functional Programming"
(ACM SIGPLAN Notices v. 33, no. 4, pp. 17-20, April 1998,
http://www.cs.princeton.edu/~appel/papers/ssafun.ps) wrote:

"The SSA (Static Single Assignment) algorithm ... is automatically
converting a FORTRAN or C procedure into a well-structured functional
program with nested scope.  Actually, I've only shown what to do with
scalar variables. Arrays are handled in high-powered (parallelizing)
compilers using sophisticated dependence analysis techniques [15],
which is another way of extracting the functional program inside the
imperative one."

The last part of the quote is precisely the thrust of the present
article. We have done manual dependency analyses of several code
snippets that update arrays in place. The analyses let us extract
pure-functional parts of imperative programs. The rest requires
serialization and can be handled by an ST monad.


Appendix: the ALU function

--   ALU State        AccA AccB  P1   P2
type ALUState elem = (elem,elem,Bool,Bool)
type Storage a elem = STArray a Integer elem

alu:: (Ord elem) => Instruction -> Storage a elem -> ALUState elem ->
ST a (Maybe (ALUState elem))
alu instr array state@(a,b,p1,p2) = case instr of
       LDA p loc -> provided p $ 
		     readSTArray array loc >>= \a->return $ Just (a,b,p1,p2)
       LDB p loc -> provided p $ 
		     readSTArray array loc >>= \b->return $ Just (a,b,p1,p2)
       STA p loc -> provided p $ 
		     writeSTArray array loc a >> (return $ Just state)
       STB p loc -> provided p $ 
		     writeSTArray array loc b >> (return $ Just state)
       SET p P1 -> provided p $ return $ Just (a,b,True,p2)
       SET p P2 -> provided p $ return $ Just (a,b,p1,True)
       CLR p P1 -> provided p $ return $ Just (a,b,False,p2)
       CLR p P2 -> provided p $ return $ Just (a,b,p1,False)
       SET_gt p P1 -> provided p $ return $ Just (a,b,a>b,p2)
       SET_gt p P2 -> provided p $ return $ Just (a,b,p1,a>b)
       HALT p ->  provided p $ return Nothing
  where
       provided P1 = if p1 then id else \_ -> return $ Just state
       provided P2 = if p2 then id else \_ -> return $ Just state
       provided Ptrue = id
       provided Pfalse = \_ -> return $ Just state