From posting-system@google.com Thu Aug 16 18:27:14 2001 Date: Thu, 16 Aug 2001 18:24:24 -0700 Reply-To: oleg@pobox.com From: oleg@pobox.com (oleg@pobox.com) X-comment: added a reference to Andrew Appel's paper on Single Static Assignment. His paper stated a thesis that a dependency analysis of imperative array code is related to extracting a functional program from an imperative one. The present article provides several illustrations. Newsgroups: comp.lang.functional Subject: FP vs assembly programming for super-scalar CPUs [Was: Nested Loops in Clean] Message-ID: <7eb8ac3e.0108161724.35a3117f@posting.google.com> Status: OR A transformation of typical imperative code into FP seems somewhat similar to the data and control dependency analyses in compilers for modern CPUs. By typical imperative code we mean a "typical numerical" code that involves nested loops and updateable arrays. We will consider three examples of such transformations. The first example is the one already discussed in this thread. The example is slightly modified to get the internal loop to do something useful. In the imperative form, the example reads for(int i=1; i<=zx; i++) for(int j=1, a=i+1, b=i*i; j<=zx; j+=1, a+=1, b+=1) for(z=j; z<=zu; z+=2) array[z] = a, array[z+1] = b; The example is actually rather simple. We note that the 'array' is only written to but never read. There are no forks of control. The only dependency to worry about is serializing of updates to the array. The solution then becomes obvious -- we first generate an "update list" -- a list that tells which element of the array should be set to which value. After we generate the list we execute the updates. The generation of the list of updates is done functionally. We execute the updates imperatively: monadically. Due to the lazy nature of Haskell, the elements of the list of updates can be executed as they are generated. Here's the corresponding Haskell code. import ST -- Generate the list of updates to _an_ array ex1_update_list zx zu = concat $ [ [(z,a), (z+1,b)] | i <- [1..zx], (j,a,b) <- takeWhile (\(j,_,_) -> j<=zx) $ iterate (\(j,a,b) -> (j+1,a+1,b+1)) (1,i+1,i*i), z <- [j,j+2..zu] ] -- execute the list of updates, in a serial manner. -- the executor is "mapM_ updater" do_loops = do { array <- newSTArray (1,4) 0; mapM_ (\(idx,new_value) -> writeSTArray array idx new_value) $ ex1_update_list 4 3; return array } -- Try to run the above and see what we get array_to_list array = sequence [ readSTArray array i | let (lwb,upb) = boundsSTArray array, i<-[lwb..upb] ] compute = runST( do_loops >>= array_to_list ) You need to use -98 flag of Hugs to load the above code. After that, you can see the result by typing "compute" at the prompt. Note that in the functional version of the code, the nested loops look "turned inside out". What used to be inside of the loop -- an assignment to the array -- now appears on the outside, as a statement in the 'do_loops'. The second example is far simpler: given an array[lwb..upb] of integers, find the sum of the elements of the array that are greater or equal than 7. Again, we start with the imperative code: int sum = 0; for(int i=lwb; i= 7 ) sum += array[i]; The analysis of this loop shows the absence of data and control dependencies on the array. The array is only read from. All elements of the array are fetched. That means that the loop can be easily parallelized and all elements of the array can be pre-fetched. -- "unsafe", speculative prefetching of the elements of the array -- No data dependencies are assumed import IOExts(unsafePerformIO) prefetch_array array = map (unsafePerformIO . stToIO) [ readSTArray array i | let (lwb,upb) = boundsSTArray array, i<-[lwb..upb] ] -- we use the previously defined function do_loops to give us -- the array to work with ex3 = sum $ filter (>=7) $ prefetch_array $ unsafePerformIO $ stToIO do_loops As our analysis indicated, the summing loop is indeed purely functional. The function prefetch_array "speculatively" prefetches the elements of the array. In database terminology, the function does a "dirty read". Note the presence of the 'map' function, which applies a function to a list in an undeterminate sequence. Due to the absence of data dependencies, the non-determinism does not matter. Again, the loop is purely functional. As the third example we consider a silly sort problem, sorting an array [0..(n-1)] in the ascending order. for(i=1; i array[j+1] ) { int temp = array[j]; array[j] = array[j+1]; array[j+1] = temp; done_already = false; } if( done_already ) break; } The algorithm should work well on small arrays, especially if they are (almost) in the right order already. This imperative code is actually quite complex for a super-scalar CPU to execute efficiently. We see data dependencies across iterations, reading from and writing into the array, conditional branches that depend on the elements of the array and the iteration history, control dependencies between the two loops. We show two solutions in Haskell. Both use fast, imperative STArrays; both try to separate the imperative part from a functional part. The imperative part is performed "in a monad" while the functional part written in ordinary Haskell. The two solutions are closely related, and written to emphasize their similarity. First the most bizarre solution: data Predicate = P1 -- predicate register 1 | P2 -- predicate register 2 | Ptrue -- always true | Pfalse -- always false deriving Show -- All instructions refer to a predicate. An instruction is executed -- only if its predicate is true data Instruction = LDA Predicate Integer -- Load from loc into accum A | LDB Predicate Integer -- Load from loc into accum B | SET_gt Predicate Predicate -- set predicate if A>B | CLR Predicate Predicate -- clear the predicate | SET Predicate Predicate -- set the predicate | STA Predicate Integer -- Store accum A into loc | STB Predicate Integer -- Store accum B into loc | HALT Predicate deriving Show ex2_instr_stream n = concat [[SET Ptrue P2] ++ concat [[LDA Ptrue j, LDB Ptrue (j+1), SET_gt Ptrue P1, STA P1 (j+1), STB P1 j, CLR P1 P2] | j<-[0..(n-i-1)]] ++ [HALT P2] | i<-[1..(n-1)]] The solution exhibits a predicate instruction execution -- as found in the most modern CPUs (e.g., Itanium) as well in Konrad Zuse's Z-3. It's instructive to type "ex2_instr_stream n" at Hugs prompt and see the result. We postpone the running of the instruction stream for a moment to consider the second solution: ex2_hi_stream n = concat [ [\array done_already -> return $ Just True] ++ [\array done_already -> do { a <- readSTArray array j; b <- readSTArray array (j+1); if a > b then writeSTArray array j b >> ( writeSTArray array (j+1) a >> ( return $ Just False )) else return $ Just done_already } | j<-[0..(n-i-1)] ] ++ [\array done_already -> return $ if done_already then Nothing else Just done_already] | i<-[1..(n-1)]] This is a higher-level version of ex2_instr_stream. Both ex2_instr_stream and ex2_hi_stream produce a stream of instructions to execute. Both functions reduce the nested loop to a stream of instructions. This is the "functional" part of the solution. The "imperative" part is the execution of the instruction stream. -- A variant of fold taking into account an early termination executor init_state commands array = run init_state commands where run Nothing _ = return array run _ [] = return array run (Just state) (cmd:cmds) = (cmd array state) >>= \new_state -> run new_state cmds -- A sample array to sort sample_array = do { array <- newSTArray (0,4) 0; writeSTArray array 0 5; writeSTArray array 1 7; writeSTArray array 2 1; writeSTArray array 3 8; writeSTArray array 4 9; return array } ex2_hi = runST( sample_array >>= executor (Just True) (ex2_hi_stream 5) >>= array_to_list ) It really works: type "ex2_hi" at the Hugs prompt to see the result. We have mentioned that the high-level instruction stream generated by 'ex2_hi_stream' is related to the low-level instruction stream, of ex2_instr_stream. Indeed they are: modulo mapping of the 'alu' function: ex2_lo = runST( sample_array >>= executor (Just (0,0,True,True)) (map alu $ ex2_instr_stream 5) >>= array_to_list ) where the alu function is the arithmetic-logical unit of our virtual machine. The function is given in the appendix. The executor function is the control unit of our virtual CPU. The analysis of data and control dependencies between loop iterations, scheduling of the instructions and interlocking of their results, pre-fetching of data -- is the job of an advanced compiler or the CPU. On the other hand, we have to do the same kind of analyses to separate the functional part of a computation from its imperative part. It appears functional programming is closer to "the bare metal" as one would've thought. We can give more examples of data/control dependency analyses of numerical algorithms (Gaussian elimination, convolutional filtering) with the goal of re-writing them in a functional language. But this article is too long and boring already. P.S. Andrew W. Appel in an article "SSA is Functional Programming" (ACM SIGPLAN Notices v. 33, no. 4, pp. 17-20, April 1998, http://www.cs.princeton.edu/~appel/papers/ssafun.ps) wrote: "The SSA (Static Single Assignment) algorithm ... is automatically converting a FORTRAN or C procedure into a well-structured functional program with nested scope. Actually, I've only shown what to do with scalar variables. Arrays are handled in high-powered (parallelizing) compilers using sophisticated dependence analysis techniques [15], which is another way of extracting the functional program inside the imperative one." The last part of the quote is precisely the thrust of the present article. We have done manual dependency analyses of several code snippets that update arrays in place. The analyses let us extract pure-functional parts of imperative programs. The rest requires serialization and can be handled by an ST monad. Appendix: the ALU function -- ALU State AccA AccB P1 P2 type ALUState elem = (elem,elem,Bool,Bool) type Storage a elem = STArray a Integer elem alu:: (Ord elem) => Instruction -> Storage a elem -> ALUState elem -> ST a (Maybe (ALUState elem)) alu instr array state@(a,b,p1,p2) = case instr of LDA p loc -> provided p $ readSTArray array loc >>= \a->return $ Just (a,b,p1,p2) LDB p loc -> provided p $ readSTArray array loc >>= \b->return $ Just (a,b,p1,p2) STA p loc -> provided p $ writeSTArray array loc a >> (return $ Just state) STB p loc -> provided p $ writeSTArray array loc b >> (return $ Just state) SET p P1 -> provided p $ return $ Just (a,b,True,p2) SET p P2 -> provided p $ return $ Just (a,b,p1,True) CLR p P1 -> provided p $ return $ Just (a,b,False,p2) CLR p P2 -> provided p $ return $ Just (a,b,p1,False) SET_gt p P1 -> provided p $ return $ Just (a,b,a>b,p2) SET_gt p P2 -> provided p $ return $ Just (a,b,p1,a>b) HALT p -> provided p $ return Nothing where provided P1 = if p1 then id else \_ -> return $ Just state provided P2 = if p2 then id else \_ -> return $ Just state provided Ptrue = id provided Pfalse = \_ -> return $ Just state