2009-12-23

Gained in Translation

First of all, I'd like to apologize for the infrequent updates and the lightness of the last few entries. I seldom have time of late for anything but the quickest of finger exercises, but I wanted to put something on writing before the year is over. What better inspiration than one of Remco Niemeijer's terse solutions to the daily Programming Praxis. This week's asks for an implementation of Parnas's permuted indices, and Remco's solution is minimal enough. I translated his Haskell code to almost-verbatim OCaml, interjecting the necessary definitions to make the code read essencially the same way. For example, I needed to translate:

rot xs = [(unwords a, unwords b) | (a, b) <- init $
          zip (inits xs) (tails xs), notElem (head b) stopList]

(n.b: this is Haskell). The function inits returns all the initial segments of a list, so that inits "abc" = ["", "a", "ab", "abc"]. Conversely, tails returns all the tails of a list, so that tails "abc" = ["abc", "bc", "c", ""]. The zip of both lists is the list of all the ways in which you can split a list, so that with our example:

zip (inits "abc") (tails "abc") = [
  ("", "abc"),
  ("a", "bc"),
  ("ab", "c"),
  ("abc", "")
]

and the init of that is every element on that list except for the last one. After writing the necessary infrastructure, the equivalent solution was simple to write, but then I noticed that I could refactor it into something a bit terser. The first opportunity for compression I found was to use an ad-hoc function for splitting a list in every way possible except the last, in effect subsuming init $ zip (inits xs) (tails xs) into a single recursive function:

let rec split_all l = match l with
| []      -> []
| x :: xs -> ([], l) :: List.map (fun (hs, ts) -> x :: hs, ts) (split_all xs)

Classic of text processing tasks in Haskell is the use of functions converting text into lists and vice versa; this required writing some simple helper functions:

let words = Str.split (Str.regexp " ")
and lines = Str.split (Str.regexp "\n")
and unwords = String.concat " "

The permuted index construction must filter a number of stop words:

let stop_list = words "a an and by for if in is of on the to"

As Remco explains, the core function for generating a permuted index finds all the splittings of a given sentence, and uses the head as the context for the tail. The function he gives is a typical generation—filtering—reduction pipeline expressed as a list comprehension. I initally wrote the comprehension as a right fold (this is always possible), and in a second phase I rewrote that into a point-free function more directly expressing the reduction. For that I needed a number of combinators:

let ( % ) f g x = f (g x)

let cross f (x, y) = (f x, f y)

let distrib f (x, y) (z, t) = (f x z, f y t)

let flip f x y = f y x

The composition operator % is an old friend of this blog. The combinator cross lifts a function over a pair, and distrib distributes a binary function over two pairs. The combinator flip swaps the arguments to a curried function. All of this is standard and it allowed me to write:

let rot = List.map (cross unwords) % List.filter (not % flip List.mem stop_list % List.hd % snd) % split_all

This function takes a list of words, splits it every which way, throws away those pairs whose second component (the tail) begins with a stop word, and joins each component of the resulting pairs of words into sentence fragments. The function pretty-printing an index underwent a similar compression: instead of finding the longest fragment separately on each component, I did it in one pass over the list of pairs:

let pp_index xs =
  let l1, l2 = List.fold_right (distrib (max % String.length)) xs (0, 0) in
  List.iter (fun (a, b) -> Printf.printf "%*s   %-*s\n" l1 a l2 b) xs

The function max % String.length composes on the first argument of the curried max, and thus has type string → int → int; in order to distribute over pairs and make the types come out right I needed to use a right_fold instead of a more natural (in OCaml) left_fold, but this is otherwise straightforward. The pretty printing of the index is exactly like in Remco's code, as both OCaml's and Haskell's printf implement the same formats. Putting everything together needs a sort on the second components; I use a (very inefficient) helper function implementing case-insensitive sort:

let ci_compare a b = compare (String.lowercase a) (String.lowercase b)

let permute_index =
  pp_index % List.sort (fun (_, a) (_, b) -> ci_compare a b) % List.concat % List.map (rot % words) % lines

The text is decomposed into lines, each line is further decomposed into words and an index is built for it, the fragmentary indexes are collated and sorted into the final result which is finally printed out. A test gives out the expected result:

let () = permute_index "All's well that ends well.\nNature abhors a vacuum.\nEvery man has a price.\n"
              Nature   abhors a vacuum.
                       All's well that ends well.
     All's well that   ends well.
                       Every man has a price.
           Every man   has a price.
               Every   man has a price.
                       Nature abhors a vacuum.
     Every man has a   price.
          All's well   that ends well.
     Nature abhors a   vacuum.
               All's   well that ends well.
All's well that ends   well.

A point to keep in mind is that permute_index and especially rot would probably have been clearer written in a monadic style, as it emphasizes an "element-at-a-time" view of list processing as I've written before. The downside would have been the need to name every intermediate value being transformed:

let rot xs =
  split_all xs >>= fun (hs, ts) ->
  guard (not (List.mem (List.hd ts) stop_list)) >>
  return (unwords hs, unwords ts)

It seems that, in this sense, monadic beats recursion but point-free beats monadic for conciseness. As it is, the 30 lines comprising this code fit in one short page. Not bad.

No comments: