Stream processing also exhibits the painful abstraction vs. performance trade-off. Manually written loops and state-machines offer the highest performance and the least memory overhead, but are not reusable or extensible. Libraries of freely composable stream components let programmers quickly assemble an obviously correct stream application, but suffer from the high overhead of abstractions, mainly due to the repeated creation and disposal of intermediate data structures such as closures, captured continuations, objects and collections -- let alone intermediate streams. Eliminating such intermediate structures is broadly known as stream fusion.
According to a survey of stream processing, ``Within computer science the term stream has been attributed to P.J. Landin, formulated during the development of operational constructs presented as part of his work on the correspondence between ALGOL 60 and the lambda-calculus. Indeed, we note that P.J. Landin's original use for streams was to model the histories of loop variables, but he also observed that streams could have been used as a model for I/O in ALGOL 60.''
R. Stephens: A Survey Of Stream Processing
Acta Informatica, July 1997, Volume 34, Issue 7, pp 491–541
Stream Fusion, to Completeness
The Related Work section of the paper
zipoperator and are still an order of magnitude slower than hand-written loops.
We present the first approach that represents the full generality of stream processing and eliminates overheads, via the use of staging. It is based on an unusually rich semantic model of stream interaction. We support any combination of zipping, nesting (or flat-mapping), sub-ranging, filtering, mapping --of finite or infinite streams. Our model captures idiosyncrasies that a programmer uses in optimizing stream pipelines, such as rate differences and the choice of a ``for'' vs. ``while'' loops. Our approach delivers hand-written--like code, but automatically. It explicitly avoids the reliance on black-box optimizers and sufficiently-smart compilers, offering highest, guaranteed and portable performance.
Our approach relies on high-level concepts that are then readily mapped into an implementation. Accordingly, we have two distinct implementations: an OCaml stream library, staged via MetaOCaml, and a Scala library for the JVM, staged via LMS. In both cases, we derive libraries richer and simultaneously many tens of times faster than past work. We greatly exceed in performance the standard stream libraries available in Java, Scala and OCaml, including the well-optimized Java 8 streams.
Joint work with Aggelos Biboudis, Nick Palladinos and Yannis Smaragdakis.
The complete code for both MetaOCaml and Scala/Java versions of the strymonas library, and the complete code for all benchmarks. The code received the Artifact Evaluated badge from the POPL 2017 artifact evaluation committee.
Functional stream libraries let us easily build stream processing
pipelines, by composing sequences of simple transformers such as
filter with producers (backed by an array, a
file, or a generating function) and consumers (reducers). The purely
applicative approach of building a complex pipeline from simple
immutable pieces simplifies programming and reasoning: the assembled
pipeline is an executable specification. To be practical, however, a
library has to be efficient: at the very least, it should avoid
creating intermediate structures -- especially structures like
files and lists whose size grows with the length of the stream. Even
the bounded-size intermediate structures significantly, up to two
orders of magnitude, slow down the processing. Eliminating the
intermediate structures is the central problem in stream processing:
so-called stream fusion.
Stream fusion has been the subject of intensive research since late 1950's. By now, the low-hanging fruit in stream processing has been all picked up -- although some of it quite recently, POPL 2017. Stream fusion made it to the front page of CACM (May 2017). Java 8 Streams and Haskell compilers, among others, have implemented some of the earlier research results. We have attained a milestone. What are the further challenges?
That was the topic of many discussions at the workshop. Several main questions have come up over and over again:
As a tangible outcome, the meeting has identified a set of problems -- challenges -- to help drive and evaluate further research: filterMax, sorted merge, multiple appends, parallel merge. The report of the meeting discusses them in detail.
The workshop is organized together with Aggelos Biboudis and Martin Odersky. The Shonan seminar series is sponsored by Japan's National Institute of Informatics (NII).
The final report
A stream-wise access to a collection is an important access method, which may even be supported by hardware. For example, Pentium III floating-point extension (Internet Streaming SIMD Extension) lets programmers designate arrays as streams and provides instructions to handle such data efficiently (Internet Streaming SIMD Extensions, Shreekant (Ticky) Thakkar and Tom Huff, Computer, Vol. 32, No. 12, December 1999, pp. 26-34). Streaming is a typical memory access model of DSPs: that's why DSP almost never incorporate a data cache (See ``DSP Processors Hit the Mainstream'', Jennifer Eyre and Jeff Bier, Computer, Vol. 31, No. 8, August 1998, pp. 51-59). A memory architecture designed in the article ``Smarter Memory: Improving Bandwidth for Streamed References'' (IEEE Computer, July 1998, pp.54-63) achieves low overall latencies because the CPU is told by a compiler that a stream operation is to follow. LinAlg offers this streaming access model to an application programmer.
Matrix streams may stride a matrix by an arbitrary amount. This lets us traverse a matrix along the diagonal, by columns, by rows, etc. Streams can be constructed of a Matrix itself, or from other matrix views (MatrixColumn, MatrixRow, MatrixDiag). In the latter case, the streams are confined only to the specific portions of the matrix.
Many functions of LinAlg are written in terms of streams: for example, the computation of vector norms, the addition of a vector to the diagonal or the anti-diagonal of a matrix, Aitken-Lagrange interpolation. Singular value decomposition SVD demonstrates many applications of streams: e.g., multiplying a matrix by a rotation matrix avoids random access to matrix elements and the corresponding range checks and offset calculations. The stream code is also more lucid. One may create a stream that spans over a part of another stream. We use substreams, for example, to efficiently reflect the upper triangle of a square matrix onto the lower one, yielding a symmetric matrix. The SVD computation uses subranging extensively, e.g., for left Householder transformations.
LinAlg's streams may span an arbitrary rectangular block of a matrix, including the whole matrix, a single matrix element, a matrix row or a column, or a part thereof. Assigning a block of one matrix to a block of another takes only one line -- which, due to inlining, is just as efficient as the direct loop with pointer manipulation.
taketransform lists, we transform folds. We implement the range of progressively more complex transformers -- from
dropWhile, and finally,
zipWith. The standard list API is also provided.
Emphatically we never convert a stream to a list and so we do not use recursion or recursive types. All iterative processing is driven by the fold itself. Therefore, all operations of the library are assuredly terminating. We only need higher-ranked types, because lists cannot be fully implemented in simply typed lambda-calculus.
The implementation of zip also solves the problem of ``parallel loops''. One can think of a fold as an accumulating loop and realize a nested loop as a nested fold. Representing a parallel loop as a fold is a challenge, answered at the end of the article. This becomes especially interesting in the case of general backtracking computations, or backtracking computations in direct style, with delimited continuations modeling `lists'.
Beyond Church encoding: Boehm-Berarducci isomorphism of algebraic data types and polymorphic lambda-terms
LogicT: backtracking monad transformer with fair operations and pruning
which illustrates the close connection with foldr/build list-fusion, aka ``short-cut deforestation''.
Predecessor and lists are not representable in
simply-typed lambda calculus
Therefore, higher-rank or recursive/inductive types are necessary for lists
Parallel composition of streams: several sources to one sink
Folding over multiple streams using monad transformers