myawk
scripts
myawk
is a small OCaml library for quick, rough and
ready scripts. It is meant as a lightweight and less idiosyncratic
replacement for AWK (let alone Perl) -- and also as an exploration and
demonstration of a better alternative to shell pipes.
As its name implies, it is a personal project, for personal scripting needs. A bit of motivation may still be in order. Why scripting in OCaml?
The drawbacks of shell scripts are well-known: late reporting of
errors, numerous edge cases (of substitutions, escaping, IFS
) often
with security implications, waste of system resources. Less criticized
are pipes, which are regarded as shell and Unix's greatest
strengths. Building a long hose, so to speak, by plugging in simple
segments is the commendable strength -- which diminishes when the
segments are not simple and need tuning in their own idiosyncratic
ways. The lack of consistency among Unix command-line tools has been
noted long time ago, and is getting worse as the number of flags keeps
increasing. Pipes as the concept also deserve criticism.
One of the motivations for myawk
is to see if an expressive streaming
could, in fact, be pleasant to use in everyday's
scripting. We introduce myawk
pipelines below, contrasting them with
shell pipes and thus pin-pointing the limitations of the latter. The
rest of this page shows off half a dozen myawk
scripts. Some of
them are re-written from shell or AWK, to compare with the classical
scripting.
OCaml is a bit verbose language. Therefore, myawk
scripts often (but
not always!) have more characters than the corresponding Perl or shell
script, although with a better performance. One does not have
to type in all these characters however. Scripts in myawk
are meant to be
entered in a text editor, where one may take the full advantage of
code completion, advanced editing, documentation and other
facilities of modern editors.
As an aside, the command line, as an artifact of teletypes and line-oriented processing of early computers, is one of the most limited text user interfaces. Lisp Machines, Acme, language modes of Emacs and other editors, and, recently, Proof General and CoqIDE showed that one may interact with a computing system -- execute commands, evaluate expressions, adjust and re-execute -- within the full editor, never having to drop to a `command line'.
myawk
core
strings.ml [5K]
The other file of myawk
library:
frequently occurring string operations, not found in Stdlib
Makefile [<1K]
How to build the library
myawk
pipelines and compare.
The simplest, the `identity' pipeline is
for_each_line @@ print_endlineThe producer
for_each_line
parses the standard input into lines and
sends each line (without the line terminator) to the consumer,
print_endline
, which prints the line on standard input adding the
newline. This myawk
pipelines looks and feels like an ordinary shell
pipe: the `process' for_each_line
produces lines, print_endline
`process' reads them across the @@
`pipe'. The only difference with
shell is pipe's transmitting structured data: lines.
No processes are launched in myawk
however: for_each_line
is an
ordinary function, print_endline
is the standard library
function. The `pipe' operator @@
also comes from the standard
library: it is the function application, of lower precedence
compared to juxtaposition. In the present example, the parsing
precedence is irrelevant and the operator can be dropped:
for_each_line print_endline
. What is merely happening is
for_each_line
applying its argument print_endline
to each line it
reads from the standard input.
A more interesting, although also essentially the identity pipeline splits lines to fields, using whitespace as the field separator:
for_each_line @@ map words @@ printHere,
words
(as the eponymous function in Haskell) splits a string
of whitespace-separated words into the list of words. The function
print
prints the list of words, using space as the (default)
separator, adding newline at the end. This pipeline can also be
interpreted as a shell pipe: between the line producer process
for_each_line
and the ultimate consumer print
we wedge the
process map words
that repeatedly
reads lines from for_each_line
and splits them,
writing the resulting list of fields into the pipe to
print
. Again, the only apparent difference from shell is that the
pipes are structured; in fact, more structured as before: the
pipe to print
has as its transmission unit a list of words.
In reality, map
is no process at all: it is a mere function
composition (with swapped arguments). The pipeline could have been
written simply, albeit less elegantly, as:
for_each_line (print <|> words)where
<|>
is the functional composition. In this `pipeline'
for_each_line
passes each line from stdin to the composition
of print
and words
. The role of map
hence is merely to arrange
individual processors in a more visually pleasing order, highlighting
the left-to-right flow of data.
One may look at map
from yet another angle: keeping in mind that
@@
is right associative, the original pipeline is in fact
for_each_line (map words print)which may be read as `adjusting' the consumer of fields
print
to consume lines instead. Here words
does the adjustment,
splitting a line into fields.
Once we took the trouble of splitting a line into fields, we can do something with them, say, extract the first field for further processing.
for_each_line @@ map words @@ map_option (function (x::_) -> Some x | _ -> None) @@ print_endlineHere
map_option
is another `adjuster' of a consumer to a producer,
transforming the produced data to fit the consumer. Only now the
transformation is allowed to fail, in which case the produced data is
disregarded. In our example, empty lines, which have no fields, are
skipped.
Why do we now use print_endline
whereas before it was print
? The
reader is encouraged to try using print
and see what happens.
Indeed, there are actually types and type checking.
Having extracted the first field, we may parse it, say, to an integer:
for_each_line @@ map words @@ map_option (function (x::_) -> Some x | _ -> None) @@ map_option int_of_string_opt @@ Printf.printf "The field is %d.\n"The two consecutive
map_option
may, of course, be fused -- which
eliminates some constant overhead. In either case, the overall
processing, just as with shell pipelines, is incremental:
for_each_line
reads a line, which is then split into fields, the
first field extracted, converted to an integer and printed. If any
conversion fails, the line is disregarded. It is after the line is fully
consumed that for_each_line
tries to read another.
All the previous myawk
pipelines could be interpreted as shell pipes
(with typed pipes), with a loss of performance. Now we come to
pipelines that cannot be viewed as shell pipes, and which are very
difficult to write in shell. Suppose that we want to amend the
previous pipeline to print not just the extracted field, converted to
an integer, but also the original line in which that field appeared.
It is easy to do if we name a line produced by
for_each_line
, for later use:
for_each_line @@ fun l -> l |> map words @@ map_option (function (x::_) -> Some x | _ -> None) @@ map_option int_of_string_opt @@ fun x -> Printf.printf "The field is %d and the line is `%s'\n" x lHere,
|>
from the standard library is yet another application
operator, with swapped arguments. (In fact, x |> f
is the same as f @@ x
, which is the same as f x
). In myawk
, we
can introduce names any time we need to refer to some
intermediate result.
The final example is a truly non-linear pipeline. It is a modification to the earlier one, to log the lines whose first field is not parseable as an int, rather than dropping them:
for_each_line @@ fun l -> l |> map words @@ map_option (function (x::_) -> Some x | _ -> None) @@ cases [ optthen int_of_string_opt (map (fun x -> x*x) @@ fun x -> Printf.printf "The squared field is %d, the line is `%s'\n" x l); fun x -> Some (Printf.eprintf "The first field `%s' in the line `%s' isn't int\n" x l) ]As the name suggests,
cases
does the case analysis, routing the incoming
data to the first alternative that succeeds in transforming them.
In our example, if a field can be parsed to an integer, the
integer is passed to the consumer for squaring and printing. If the
field cannot be parsed, it is passed to the second
alternative.
myawk
was meant for AWK-like processing, like the
following. It is similar to the examples explained in the tutorial
above. Below is the complete code:
#!/bin/env -S ocaml #load "myawk.cma" open Myawk open Strings let hash = string_of_int <|> Hashtbl.hash ;; (* Sanitize the files originally used by the example further below. The files are made of space-separated fields; the first field is the key. It is sensitive; but because it is a key it can't be replaced with meaningless garbage. We obfuscate it beyond recognition. The third field is obfuscated as well. The second and fourth can be left as they are, and the fifth, if present, is replaced with XXX The script is a proper filter: reads from stdin, writes to stdout *) for_each_line @@ map words @@ function (f1::f2::f3::f4::rest) -> print [hash f1; f2; hash f3; f4; if rest = [] then "" else "XXX"] ;;
myawk
was a script to perform
a join on two text files -- what one could write in SQL as
SELECT T2.* from Table1 as T1, Table2 as T2 where T1.f1 = T2.f1where
Table1
and Table2
are actually text files with space-separated column
values. Here is the first version of the script (Table1
is supposed
to be fed to stdin):
for_each_line @@ map words @@ map_option (function (x::_) -> Some x | _ -> None) @@ (ignore <|> shell "grep %s table1.txt")
It is a typical rough-and-dirty script. Alas, it was too rough: I was so excited that it typechecked and worked the first time, that I did not look carefully at the output and overlooked what I was looking for (resulting in an unneeded hassle and apology). I should have queried exactly for what I wanted:
SELECT T1.f1, T1.f4 FROM Table1 as T1, Table2 as T2 WHERE T1.f1 = T2.f1 AND T1.f3 <> "3"
which is actually easy to write in myawk
(probably not so in AWK though)
for_each_line ~fname:"table2.txt" @@ map words @@ map_option (function (w::_) -> Some w | _ -> None) @@ fun w -> for_each_line ~fname:"table1.txt" @@ map words @@ map_option (function (x::f2::f3::f4::_) when x = w && f4 <> "3" -> Some [x;f4] | _ -> None) @@ printThis is the classical nested loop join.
join2.ml [<1K]
An honest nested-loop join: producing just
what I was looking for
table1.txt [3K]
table2.txt [<1K]
The `database' files for join?.ml
myawk
do not have to be linear: dataflow may split
into several branches, routed by guards.
The motivation came from the following script to partition a file
into two:
cat ./bench_result.txt | grep baseline >| bench_baseline.txt && \ cat ./bench_result.txt | grep staged >| bench_staged.txt
Here is its rendition in myawk
, which reads the input file only
once. Also, it prints the lines which contain neither `baseline' nor `staged'
on the standard output. It hence splits the input three-way.
let () = with_output_file "bench_baseline.txt" @@ fun chbase -> with_output_file "bench_staged.txt" @@ fun chstaged -> for_each_line ~fname:"bench_result.txt" @@ fun l -> l |> cases [ optthen (strstr "baseline") (fun _ -> print ~ch:chbase [l]); optthen (strstr "staged") (fun _ -> print ~ch:chstaged [l]); fun l -> Some (print_endline l) (* default case *) ]
stdlib/remove_module_aliases.awk
,
whose contents is shown below
BEGIN { in_aliases=0 } NR == 1 { printf ("# 1 \"%s\"\n", FILENAME) } /^\(\*MODULE_ALIASES\*\)\r?$/ { in_aliases=1 } !in_aliases { print }
The stateful processing in realized in myawk
as follows:
let filename = Sys.argv.(1) let in_aliases = ref false let () = Printf.printf "# 1 \"%s\"\n" filename let () = for_each_line ~fname:filename @@ fun l -> if is_prefix "(*MODULE_ALIASES*)" l <> None then in_aliases := true; if not !in_aliases then print_endline lThere is no need for the idiosyncratic
BEGIN {}
and the
record counter NR
.
myawk
well, but does show what is to decry about shell
pipes. One can do better in most any other language.
The original shell script is a sample GIT commit hook,
.git/hooks/commit-msg.sample
found in any GIT repository. Its
comments explain:
Called by "git commit" with one argument, the name of the file that has the commit message. The hook should exit with non-zero status after issuing an appropriate message if it wants to stop the commit. The hook is allowed to edit the commit message file.This example catches duplicate Signed-off-by lines.
Here is the shell script itself
test "" = "$(grep '^Signed-off-by: ' "$1" | sort | uniq -c | sed -e '/^[ ]*1[ ]/d')" || { echo >&2 Duplicate Signed-off-by lines. exit 1 }
Looking at the original shell script brings despair. Not only the
script is algorithmically ugly: if a duplicate signed-off line occurs
near the beginning, we should be able report it right away and
stop. We do not need to read the rest of the commit message, filter
it, sort it, precisely count all duplicates and filter again. Not only
the script gratuitously wastes system resources by launching many
processes and allocating communication
buffers. Mainly, the script is not good for its primary purpose: it
is not easy to write and read. Pipeline composition of small stream
processors is generally a good thing -- but not when each stream
processor is written in its own idiosyncratic language. (For example:
although one may have a general idea about uniq
, what does uniq -c
do? Understanding the sed
phrase requires reading yet another man page.)
Incidentally,
I have doubts about the script: I think that quotes around $1
are meant to be embedded; but why they are not
escaped then? Probably it is some edge case of bash, out of several
thousands.
In contrast, the OCaml script below does exactly what is required, with no extra work -- and can be understood without reading many man pages.
module H = Hashtbl let commit_msg = Sys.argv.(1) let ht = H.create 5 let () = for_each_line ~fname:commit_msg @@ fun l -> if is_prefix "Signed-off-by: " l <> None then begin if H.find_opt ht l <> None then begin prerr_endline "Duplicate Signed-off-by lines."; exit 1 end else H.add ht l () end