shell/AWK/Perl-like scripting in OCaml

 

 

Introduction

The library myawk is a small OCaml library for quick, rough and ready scripts. It is meant as a lightweight and less idiosyncratic replacement for AWK (let alone Perl) -- and also as an exploration and demonstration of a better alternative to shell pipes.

As its name implies, it is a personal project, for personal scripting needs. A bit of motivation may still be in order. Why scripting in OCaml?

The drawbacks of shell scripts are well-known: late reporting of errors, numerous edge cases (of substitutions, escaping, IFS) often with security implications, waste of system resources. Less criticized are pipes, which are regarded as shell and Unix's greatest strengths. Building a long hose, so to speak, by plugging in simple segments is the commendable strength -- which diminishes when the segments are not simple and need tuning in their own idiosyncratic ways. The lack of consistency among Unix command-line tools has been noted long time ago, and is getting worse as the number of flags keeps increasing. Pipes as the concept also deserve criticism.

One of the motivations for myawk is to see if an expressive streaming could, in fact, be pleasant to use in everyday's scripting. We introduce myawk pipelines below, contrasting them with shell pipes and thus pin-pointing the limitations of the latter. The rest of this page shows off half a dozen myawk scripts. Some of them are re-written from shell or AWK, to compare with the classical scripting.

OCaml is a bit verbose language. Therefore, myawk scripts often (but not always!) have more characters than the corresponding Perl or shell script, although with a better performance. One does not have to type in all these characters however. Scripts in myawk are meant to be entered in a text editor, where one may take the full advantage of code completion, advanced editing, documentation and other facilities of modern editors.

As an aside, the command line, as an artifact of teletypes and line-oriented processing of early computers, is one of the most limited text user interfaces. Lisp Machines, Acme, language modes of Emacs and other editors, and, recently, Proof General and CoqIDE showed that one may interact with a computing system -- execute commands, evaluate expressions, adjust and re-execute -- within the full editor, never having to drop to a `command line'.

References
myawk.ml [3K]
The myawk core

strings.ml [5K]
The other file of myawk library: frequently occurring string operations, not found in Stdlib

Makefile [<1K]
How to build the library

Stream processing

 

Pipelines in myawk, contrasted to shell pipes

When it comes to shell pipes, more often than not one finds praise rather than criticism. Yet there are plenty of drawbacks. Some are obvious: a pipe transmits unstructured, untyped stream, which the recipient has to parse and validate. PowerShell, among others, recognize and address this point. There are deeper problems, however, which are rarely mentioned. To see them, let's look at myawk pipelines and compare.

The simplest, the `identity' pipeline is

    for_each_line @@ print_endline
The producer for_each_line parses the standard input into lines and sends each line (without the line terminator) to the consumer, print_endline, which prints the line on standard input adding the newline. This myawk pipelines looks and feels like an ordinary shell pipe: the `process' for_each_line produces lines, print_endline `process' reads them across the @@ `pipe'. The only difference with shell is pipe's transmitting structured data: lines.

No processes are launched in myawk however: for_each_line is an ordinary function, print_endline is the standard library function. The `pipe' operator @@ also comes from the standard library: it is the function application, of lower precedence compared to juxtaposition. In the present example, the parsing precedence is irrelevant and the operator can be dropped: for_each_line print_endline. What is merely happening is for_each_line applying its argument print_endline to each line it reads from the standard input.

A more interesting, although also essentially the identity pipeline splits lines to fields, using whitespace as the field separator:

    for_each_line @@ map words @@ print
Here, words (as the eponymous function in Haskell) splits a string of whitespace-separated words into the list of words. The function print prints the list of words, using space as the (default) separator, adding newline at the end. This pipeline can also be interpreted as a shell pipe: between the line producer process for_each_line and the ultimate consumer print we wedge the process map words that repeatedly reads lines from for_each_line and splits them, writing the resulting list of fields into the pipe to print. Again, the only apparent difference from shell is that the pipes are structured; in fact, more structured as before: the pipe to print has as its transmission unit a list of words.

In reality, map is no process at all: it is a mere function composition (with swapped arguments). The pipeline could have been written simply, albeit less elegantly, as:

    for_each_line (print <|> words)
where <|> is the functional composition. In this `pipeline' for_each_line passes each line from stdin to the composition of print and words. The role of map hence is merely to arrange individual processors in a more visually pleasing order, highlighting the left-to-right flow of data.

One may look at map from yet another angle: keeping in mind that @@ is right associative, the original pipeline is in fact

    for_each_line (map words print)
which may be read as `adjusting' the consumer of fields print to consume lines instead. Here words does the adjustment, splitting a line into fields.

Once we took the trouble of splitting a line into fields, we can do something with them, say, extract the first field for further processing.

    for_each_line @@ 
    map words @@ 
    map_option (function (x::_) -> Some x | _ -> None) @@
    print_endline
Here map_option is another `adjuster' of a consumer to a producer, transforming the produced data to fit the consumer. Only now the transformation is allowed to fail, in which case the produced data is disregarded. In our example, empty lines, which have no fields, are skipped.

Why do we now use print_endline whereas before it was print? The reader is encouraged to try using print and see what happens. Indeed, there are actually types and type checking.

Having extracted the first field, we may parse it, say, to an integer:

    for_each_line @@ 
    map words @@ 
    map_option (function (x::_) -> Some x | _ -> None) @@
    map_option int_of_string_opt @@
    Printf.printf "The field is %d.\n"
The two consecutive map_option may, of course, be fused -- which eliminates some constant overhead. In either case, the overall processing, just as with shell pipelines, is incremental: for_each_line reads a line, which is then split into fields, the first field extracted, converted to an integer and printed. If any conversion fails, the line is disregarded. It is after the line is fully consumed that for_each_line tries to read another.

All the previous myawk pipelines could be interpreted as shell pipes (with typed pipes), with a loss of performance. Now we come to pipelines that cannot be viewed as shell pipes, and which are very difficult to write in shell. Suppose that we want to amend the previous pipeline to print not just the extracted field, converted to an integer, but also the original line in which that field appeared. It is easy to do if we name a line produced by for_each_line, for later use:

    for_each_line @@ fun l -> l |>
    map words @@ 
    map_option (function (x::_) -> Some x | _ -> None) @@
    map_option int_of_string_opt @@
    fun x -> Printf.printf "The field is %d and the line is `%s'\n" x l
Here, |> from the standard library is yet another application operator, with swapped arguments. (In fact, x |> f is the same as f @@ x, which is the same as f x). In myawk, we can introduce names any time we need to refer to some intermediate result.

The final example is a truly non-linear pipeline. It is a modification to the earlier one, to log the lines whose first field is not parseable as an int, rather than dropping them:

    for_each_line @@ fun l -> l |>
    map words @@ 
    map_option (function (x::_) -> Some x | _ -> None) @@
    cases [
    optthen int_of_string_opt 
     (map (fun x -> x*x) @@ 
      fun x -> Printf.printf "The squared field is %d, the line is `%s'\n" x l);
    fun x -> Some
        (Printf.eprintf "The first field `%s' in the line `%s' isn't int\n" x l)
    ]
As the name suggests, cases does the case analysis, routing the incoming data to the first alternative that succeeds in transforming them. In our example, if a field can be parsed to an integer, the integer is passed to the consumer for squaring and printing. If the field cannot be parsed, it is passed to the second alternative.
References
pipes.ml [3K]
pipes_test.txt [<1K]
The complete code for the examples and the input file for them

shell pipes aren't that great

Stream processing

 

Typical AWK-like processing

The library myawk was meant for AWK-like processing, like the following. It is similar to the examples explained in the tutorial above. Below is the complete code:
    #!/bin/env -S ocaml
    
    #load "myawk.cma"
    open Myawk open Strings
    let hash = string_of_int <|> Hashtbl.hash
    ;;
    (* Sanitize the files originally used by the example further below.
       The files are made of space-separated fields; the first field is the
       key. It is sensitive; but because it is a key it can't be replaced with
       meaningless garbage. We obfuscate it beyond recognition. The third field 
       is obfuscated as well. The second and fourth can be left as they are,
       and the fifth, if present, is replaced with XXX
    
       The script is a proper filter: reads from stdin, writes to stdout
     *)
    
    for_each_line @@ map words @@ function (f1::f2::f3::f4::rest) ->
      print [hash f1; f2; hash f3; f4; if rest = [] then "" else "XXX"]
    ;;

 

Database join

One of the first real applications of myawk was a script to perform a join on two text files -- what one could write in SQL as
    SELECT T2.* from Table1 as T1, Table2 as T2 where T1.f1 = T2.f1
where Table1 and Table2 are actually text files with space-separated column values. Here is the first version of the script (Table1 is supposed to be fed to stdin):
    for_each_line @@ map words @@ 
    map_option (function (x::_) -> Some x | _ -> None) @@
    (ignore <|> shell "grep %s table1.txt")

It is a typical rough-and-dirty script. Alas, it was too rough: I was so excited that it typechecked and worked the first time, that I did not look carefully at the output and overlooked what I was looking for (resulting in an unneeded hassle and apology). I should have queried exactly for what I wanted:

    SELECT T1.f1, T1.f4 FROM Table1 as T1, Table2 as T2
    WHERE T1.f1 = T2.f1 AND T1.f3 <> "3"

which is actually easy to write in myawk (probably not so in AWK though)

    for_each_line ~fname:"table2.txt" @@ map words @@ 
    map_option (function (w::_) -> Some w | _ -> None) @@
    fun w -> 
      for_each_line ~fname:"table1.txt" @@  map words @@
      map_option (function 
       (x::f2::f3::f4::_) when x = w && f4 <> "3" -> Some [x;f4] | _ -> None) @@ 
      print
This is the classical nested loop join.
References
join1.ml [<1K]
The first, too rough-and-ready version

join2.ml [<1K]
An honest nested-loop join: producing just what I was looking for

table1.txt [3K]
table2.txt [<1K]
The `database' files for join?.ml

 

Case-branching

The pipelines in myawk do not have to be linear: dataflow may split into several branches, routed by guards. The motivation came from the following script to partition a file into two:
    cat ./bench_result.txt | grep baseline >| bench_baseline.txt && \
    cat ./bench_result.txt | grep staged   >| bench_staged.txt

Here is its rendition in myawk, which reads the input file only once. Also, it prints the lines which contain neither `baseline' nor `staged' on the standard output. It hence splits the input three-way.

    let () =
      with_output_file "bench_baseline.txt" @@ fun chbase ->
      with_output_file "bench_staged.txt" @@ fun chstaged ->
      for_each_line ~fname:"bench_result.txt" @@ fun l ->
      l |> cases [
       optthen (strstr "baseline") (fun _ -> print ~ch:chbase [l]);
       optthen (strstr "staged")   (fun _ -> print ~ch:chstaged [l]);
       fun l -> Some (print_endline l)      (* default case *)
      ]
References
partition.ml [<1K]
The complete script

 

Stateful AWK script

The prototype for the present example was an AWK script in the OCaml distribution, stdlib/remove_module_aliases.awk, whose contents is shown below
    BEGIN { in_aliases=0 }
    NR == 1 { printf ("# 1 \"%s\"\n", FILENAME) }
    /^\(\*MODULE_ALIASES\*\)\r?$/ { in_aliases=1 }
    !in_aliases { print }

The stateful processing in realized in myawk as follows:

    let filename = Sys.argv.(1)
    let in_aliases = ref false
    let () = Printf.printf "# 1 \"%s\"\n" filename
    let () = 
      for_each_line ~fname:filename @@ fun l ->
        if is_prefix "(*MODULE_ALIASES*)" l <> None then in_aliases := true;
        if not !in_aliases then print_endline l
There is no need for the idiosyncratic BEGIN {} and the record counter NR.
References
remove_module_aliases.ml [<1K]
The complete script

 

shell pipes aren't that great

The example below does not show off myawk well, but does show what is to decry about shell pipes. One can do better in most any other language.

The original shell script is a sample GIT commit hook, .git/hooks/commit-msg.sample found in any GIT repository. Its comments explain:

Called by "git commit" with one argument, the name of the file that has the commit message. The hook should exit with non-zero status after issuing an appropriate message if it wants to stop the commit. The hook is allowed to edit the commit message file.

This example catches duplicate Signed-off-by lines.

Here is the shell script itself

    test "" = "$(grep '^Signed-off-by: ' "$1" |
             sort | uniq -c | sed -e '/^[ 	]*1[ 	]/d')" || {
            echo >&2 Duplicate Signed-off-by lines.
            exit 1
    }

Looking at the original shell script brings despair. Not only the script is algorithmically ugly: if a duplicate signed-off line occurs near the beginning, we should be able report it right away and stop. We do not need to read the rest of the commit message, filter it, sort it, precisely count all duplicates and filter again. Not only the script gratuitously wastes system resources by launching many processes and allocating communication buffers. Mainly, the script is not good for its primary purpose: it is not easy to write and read. Pipeline composition of small stream processors is generally a good thing -- but not when each stream processor is written in its own idiosyncratic language. (For example: although one may have a general idea about uniq, what does uniq -c do? Understanding the sed phrase requires reading yet another man page.) Incidentally, I have doubts about the script: I think that quotes around $1 are meant to be embedded; but why they are not escaped then? Probably it is some edge case of bash, out of several thousands.

In contrast, the OCaml script below does exactly what is required, with no extra work -- and can be understood without reading many man pages.

    module H = Hashtbl
  
    let commit_msg = Sys.argv.(1)
    let ht = H.create 5
    let () =
      for_each_line ~fname:commit_msg @@ fun l ->
      if is_prefix "Signed-off-by: " l <> None then begin
        if H.find_opt ht l <> None then begin
          prerr_endline "Duplicate Signed-off-by lines.";
          exit 1
        end else
          H.add ht l ()
      end
References
commit_msg.ml [<1K]
The complete script