; Select a uniformly distributed random node from a tree in one pass,
; without an a priori knowledge on the number of nodes in the tree.
;
; The algorithm is an instance of a Reservoir sampling:
; Select at random N records from a sequence -- without a
; priori knowledge of the total number of records in a sequence
; (provided that this number is greater than N). The method guarantees
; that each record is selected with a probability of N/M, where M is the
; total number of the records in the sequence.
;
; See
; Reservoir Sampling
; by Paul F. Hultquist and William R. Mahoney
; Dr. Dobbs J., January 2001, p. 189.
; The "Algorithm Alley" column.
; The algorithm was originally developed by Alan Waterman.
;
; Article posting headers:
; Date: Tue, 15 Apr 2003 22:17:15 -0700
; Newsgroups: comp.lang.scheme
; Subject: Re: random node in tree
; References:
; Message-ID: <7eb8ac3e.0304152117.3124c038@posting.google.com>
;
; $Id: random-tree-node.scm,v 1.1 2003/04/17 02:13:43 oleg Exp oleg $
; procedure random-node TREE -> [COUNT NODE]
; Traverse the TREE _once_ and return COUNT, the total number of nodes
; in the tree, and NODE -- one node of the tree selected with
; the probability of 1/COUNT. TREE must be a pair.
;
; We consider '() to be the absence of a child. Thus
; '(a . ()) is a tree with one, left child.
; '(() . a) is a tree with one right child.
; We do not count leaves as nodes. Only internal nodes (aka, pairs)
; count. If we wish to count leaves too (which are atoms other than
; '()), replace "(if (not (pair? node)) ...)" test below with
; "(if (null? node) ...)".
; We assume that a procedure (random i j) returns a random integer k,
; i <= k <= j, that is uniformly distributed within the interval [i,j]
; (endpoints included!)
(define (random-node tree)
(let select ((node (car tree)) (p 1) (reservoir tree)
(todo (list (cdr tree))))
(if (not (pair? node)) ; Leaves don't count
(if (null? todo) (values p reservoir)
(select (car todo) p reservoir (cdr todo)))
(let*
((p (+ 1 p))
(k (random 1 p))
(reservoir (if (= k 1) node reservoir)))
(if (pair? node)
(select (car node) p reservoir (cons (cdr node) todo))
(if (null? todo) (values p reservoir)
(select (car todo) p reservoir (cdr todo))))))))
; Proof of the algorithm.
;
; Claim:
; At each invocation of
; (select node p reservoir todo)
; p>=1 is the number of previously traversed nodes (up to but not including
; 'node')
; 'node' is either '() or the the (p+1)-th node of the tree
; reservoir is the node randomly selected from the traversed
; with the probability 1/p
; todo is the stack of right brothers of 'node'
; Proof by induction:
;
; Base case: initial invocation.
; 'node' is the left child of the root or '(), todo a singleton list
; that contains its right brother, p = 1 and reservoir is the traversed
; node (which is the root).
; Claim holds.
;
; Induction hypothesis: Claim holds after q nodes are traversed,
; as we enter (select node q reservoir todo)
; If 'node' does not count (it's '() or a leaf, if we don't count leaves)
; we skip it and continue the pre-order traversal.
; If 'node' is the node that counts, we set q' to q+1, k to be a
; number uniformly distributed 1 <= k <= q'. The number k has the probability
; 1/q' = 1/(q+1) of being 1.
; We set reservoir to be 'node' with the probability 1/(q+1),
; we maintain the current value of the reservoir with the probability
; 1 - 1/(q+1) = q/(q+1). Thus reservoir is one of the q previously
; traversed nodes selected with the probability 1/q * q/(q+1) = 1/(q+1).
; If node has children, we recursively enter select, with
; the first argument being the left child of 'node' or nil,
; the second argument (q+1) -- the number of nodes traversed,--
; reservoir is the node selected uniformly at random from the traversed,
; todo is the (grown) stack of the right brothers of the first argument.
; The claim holds.
; If 'node' is not a pair but 'todo' is not empty, we re-enter
; select but the invariant holds. If we don't count leaves as nodes,
; the latter alternative does not apply.
;
; It follows from the claim that when 'select' exits,
; it returns the total number of nodes in the tree and one node
; uniformly randomly selected from the them.
; Tests
; In the following we define (random i j) to always return its left
; boundary.
(define (random i j) i)
; With such a definition of "random", (random-node tree) will return
; the last traversed node.
(define (test-random-node tree)
(display "Tree: ") (display tree) (newline)
(call-with-values
(lambda () (random-node tree))
(lambda (count rand-node)
(display "total count: ") (display count)
(display " selected node: ") (display rand-node)
(newline))))
(test-random-node '(a))
(test-random-node '(() . (b c)))
(test-random-node '(() . (b . c)))
(test-random-node '(* (+ 3 4) 5))
; Results of the test runs
; Tree: (a)
; total count: 1 selected node: (a)
; Tree: (() b c)
; total count: 3 selected node: (c)
; Tree: (() b . c)
; total count: 2 selected node: (b . c)
; Tree: (* (+ 3 4) 5)
; total count: 6 selected node: (5)