2007-12-05

Functional Refactoring

A common and simple enough task: count the words present in a text file. For our purposes (and as it is generally agreed), a word is a contiguous sequence of alphabetic characters, delimited by non-alphabetic characters. We should be mindful of edge cases: both the beginning and the end of the file are treated as non-alphabetic characters; in other words, the first and last words of the file, if any, must not be lost.

I found lying around in my code folder (you'd be surprised) the following OCaml code:

let iter_words f inch =
  let ibf = String.create buflen
  and obf = Buffer.create 80
  and any = ref true
  and inw = ref false in
  while !any do
    let nread = unsafe_input inch ibf 0 buflen in
    for i = 0 to pred nread do
      let c = String.unsafe_get ibf i in
      if !inw then begin
        if isalpha c then
          Buffer.add_char obf c
        else begin
          f (Buffer.contents obf);
          Buffer.clear obf;
          inw := false
        end
      end else if isalpha c then begin
        Buffer.add_char obf c;
        inw := true
      end
    done;
    any := nread != 0
  done;
  if !inw then f (Buffer.contents obf)

The first thing to notice is that this is an iterator over the file's words: for every one found, the parameter f is called with its text. What f does with the word is its problem; the advantage of this approach is that you can process files without having to load the entire result on memory; the main disadvantage is that you are forced to process words in sequence, and handle the intermediate state by yourself.

Inside the function, the code is unabashedly imperative: a pair of nested loops process the file, the outer reading a bufferful at a time over ibf (buflen is implied, and defined somewhere else with a suitable value; so is unsafe_input), the inner scanning the buffer a character at a time and accumulating eventual words in obf. Inside the inner loop, a state machine uses inw to track whether it is scanning a word or a non-word, changing according to predicate isalpha (also suitably defined). This state must survive entire buffers, as a word might stride one. A further state variable, any, controls the outer loop termination.

All very straightforward (this code has modification date October 1st, 2006, and I could read it after all this time without problem), but also very, very ugly: this style may be appropriate for C, but it is not especially defensible in OCaml. So, I set out first to replace the condition variable by a functional state machine. The idea is to have a number of recursive functions, each for every state, which, instead of transitioning, returns the next function to call. This is very similar to the object-oriented GoF pattern "State Machine".

The only complication is that in order to express this pattern we need recursive types: a state is the type of functions of a parameter of type, say, α returning a state! In other words, the type is α → β as β. For OCaml to accept this, it needs the -rectypes option; the problem with this extension is that the type-checker accepts a lot of junk along with the valid definitions, making development more fragile. The alternative is to wrap the recursion so that it goes through a type constructor: either an algebraic data-type, or a record type. I choose the latter:

type 'a state = { fn : 'a -> 'a state }

The function signature is the same, and I'll keep for now the input and output buffers:

let iter_words f inch =
  let ibf = String.create buflen
  and obf = Buffer.create 80 in

The first state is "scanning outside a word"; if the next character is alphabetic, it must start a new word and transition, otherwise it keeps looking:

  let
  rec no_word = { fn =
    fun c ->
      if isalpha c then begin
        Buffer.add_char obf c;
        in_word
      end else
        no_word
  }

The second state is "scanning inside a word"; if the next character is alphabetic, it accumulates it and keeps scanning, otherwise it ends the word, sends it to the accumulating function f and transitions:

  and in_word = { fn =
    fun c ->
      if isalpha c then begin
        Buffer.add_char obf c;
        in_word
      end else begin
        let () = f (Buffer.contents obf) in
        Buffer.clear obf;
        no_word
      end
  }
  in

A buffer is scanned character by character (I've renamed the variables formerly known as nread and i). It must, however, conserve the last known state, so that words spanning buffers are not incorrectly split:

  let rec scan pos len state =
    if pos == len then state else
    let c = String.unsafe_get ibf pos in
    scan (succ pos) len (state.fn c)
  in

Note how the state handler returns its successor. The file must then be read a buffer at a time. Again, it must conserve the state across calls to the buffer scanning loop, terminating when there is no more data to read:

  let rec ioread state =
    let nread = unsafe_input inch ibf 0 buflen in
    if nread == 0 then state else
    ioread (scan 0 nread state)
  in

Finally, the entire device is set into motion by reading outside a word (remember, the beginning-of-file is taken as an implied non-word character). If the ending state is in_word, the last word must be taken into account and processed (again, the end-of-file is taken as an implied non-word character):

  if ioread no_word == in_word then f (Buffer.contents obf)

From 26 lines to 38 lines, a modest increase in code length for a great improvement in modularity and (I'd argue) understandability: higher-order functions are a powerful abstraction device. I could, however do better: there are many explicitly recursive functions that beg to be generalized as recursion schemes, or patterns.

The first recursion scheme that should be pulled out is the reading loop. It uses as a free variable the input buffer that it shares with the rest of the code. Abstracting out the work done on each buffer, and accumulating on an explicit parameter:

let fold_in f e inch =
  let ibf = String.create buflen in
  let rec fold e =
    let nread = unsafe_input inch ibf 0 buflen in
    if nread == 0 then e else
    fold (f e ibf nread)
  in fold e

The parameter f will take the the last accumulated value e, and the newly read data in the form of the buffer and the count of valid characters in it; it must return the updated result for the next iteration. This means that I must manage to supply scan with the relevant state. But I will also abstract out the iteration over a string. Unfortunately, the String module doesn't define a fold on strings, much less one on a substring:

let fold_string f e str pos len =
  let rec fold pos e =
    if pos == len then e else
    let c = String.unsafe_get str pos in
    fold (succ pos) (f e c)
  in fold pos e

I submit that both functions are independently useful, especially the last one. The input iteration is just:

  let ioread state =
    fold_in scan state inch

In turn, the string iteration (with parameters suitably rearranged to allow for η-conversion) is:

  let scan state buf len =
    fold_string (fun state -> state.fn) state buf 0 len

Now there is not much point in naming the simple folds; η-converting and inlining:

  let ioread state =
    fold_in (fun state buf -> fold_string (fun state -> state.fn) state buf 0)
      state inch

The state machine doesn't change, and the output buffer can still be kept free ("global", so to speak) in it. The entire code is:

let iter_words f inch =
  let obf = Buffer.create 80 in
  let
  rec no_word = { fn =
    fun c ->
      if isalpha c then begin
        Buffer.add_char obf c;
        in_word
      end else
        no_word
  }
  and in_word = { fn =
    fun c ->
      if isalpha c then begin
        Buffer.add_char obf c;
        in_word
      end else begin
        let () = f (Buffer.contents obf) in
        Buffer.clear obf;
        no_word
      end
  }
  in
  let ioread state =
    fold_in (fun state str -> fold_string (fun state -> state.fn) state str 0)
      state inch
  in
  if ioread no_word == in_word then f (Buffer.contents obf)

The code is, now, 44 lines long, counting the abstracted folds. There is nothing, however, that prevents the state to include both the callback function to the overall word iteration and the output buffer. The result is not as pretty, because all those lifted parameters clutter the function signatures a bit; it has, however the advantage that each function is top-level:

let
rec no_word = { fn =
  fun (f, buf, c) ->
    if isalpha c then begin
      Buffer.add_char buf c;
      in_word
    end else
      no_word
}

and in_word = { fn =
  fun (f, buf, c) ->
    if isalpha c then begin
      Buffer.add_char buf c;
      in_word
    end else begin
      let () = f (Buffer.contents buf) in
      Buffer.clear buf;
      no_word
    end
}

let ioread =
  fold_in (fun st str ->
    fold_string (fun (f, buf, state) c -> (f, buf, state.fn (f, buf, c)))
      st str 0)

let iter_words f inch =
  let (_, buf, state) = ioread (f, Buffer.create 80, no_word) inch in
  if state == in_word then f (Buffer.contents buf)

The code is more compact at 31 lines, thanks to reusing the folds. A tweak that could make the code a bit tidier is to change the type of the states to be transducing instead of just accepting. The result of a transducing state includes the information passed to it updated by the state as needed; this means that, whereas the former code worked because the state (as represented by the Buffer) was mutable, the following is general:

type ('a, 'b) trans = { st : 'a -> 'b -> 'a * ('a, 'b) trans }

let
rec no_word = { st =
  fun (f, buf as s) c ->
    if isalpha c then begin
      Buffer.add_char buf c;
      s, in_word
    end else
      s, no_word
}

and in_word = { st =
  fun (f, buf as s) c ->
    if isalpha c then begin
      Buffer.add_char buf c;
      s, in_word
    end else begin
      let () = f (Buffer.contents buf) in
      Buffer.clear buf;
      s, no_word
    end
}

let ioread =
  fold_in (fun e str ->
    fold_string (fun (t, state) -> state.st t)
      e str 0)

let iter_words f inch =
  let buf = Buffer.create 80 in
  let (_, state) = ioread ((f, buf), no_word) inch in
  if state == in_word then f (Buffer.contents buf)

And this is essentially the 34 lines of code that you would write in Haskell, except that the folds would be provided to you in the corresponding monads. The only imperative bits are the file input and the use of an extensible buffer for performance.

Speaking of which, what is the cost of abstracting out the structure? I tested all five versions with the Project Gutember e-text for Beowulf. I ran the tests seven times, took their mean time, and normalized it to that of the first version. First, bytecode:

Abstraction Cost, bytecode (ocamlopt.opt -inline 10 -unsafe -o ioread.exe unix.cmxa ioread.ml)
VersionSpeed (Rel.)Speed-down
1.0000.0 %
1.2721.3 %
1.3827.5 %
1.8144.8 %
2.2455.4 %

Then, naïve native code:

Abstraction Cost, native code, non-inlined (ocamlopt.opt -o ioread.exe unix.cmxa ioread.ml)
VersionSpeed (Rel.)Speed-down
1.0000.0 %
1.2419.1 %
1.3827.8 %
1.7944.3 %
2.3056.4 %

Lastly, optimized native code:

Abstraction Cost, native code, inlined (ocamlopt.opt -inline 10 -unsafe -o ioread.exe unix.cmxa ioread.ml)
VersionSpeed (Rel.)Speed-down
1.0000.0 %
1.2721.3 %
1.3827.5 %
1.7843.9 %
2.2455.4 %

As you can see, the differences are pretty consistent and independent of the target architecture. Furthermore, the cost of a fully functional, monadic read is more than half more that of the imperative double-loop. The good thing about OCaml is not that it is an efficient functional language, but that it lets you move seamlessly between functional and imperative style of coding, according to your preferences and your needs.

No comments: