2011-11-11

Modular Semantics for Brainfuck

The problem with Brainfuck is that its semantics are given by its different implementations but not all its implementations agree so that writing an interpreter is straightforward but writing portable Brainfuck programs is not. OCaml functors allows playing with pluggable semantics in a way that would be very difficult laborious with other languages. I can't imagine this flexibility in object-oriented languages like Java or Python, but in OCaml the mix-and-match approach with module inclusion is straightforward.

Update: OK, so reddit justly challenged my grandstanding. The moral equivalent of this typesafe, modular, extensible interpreter requires more than 520 tight lines, of which 270 go to the parser, pretty-printer and the evaluator (two visitors over 7 case classes), and 250 go to the plug-in semantics. The end result is more flexible than the following OCaml code, since I can choose the composition dynamically, at run-time:

public static void main(String[] args) {
  String hello =
">+++++++++[<++++++++>-]<.>+++++++[<++++>-]<+.+++++++..+++.>>>++++++++[<++++>-]"+
"<.>>>++++++++++[<+++++++++>-]<---.<<<<.+++.------.--------.>>+.";

  Semantics s = new S3_(5000, new S2(new S1_(new S0())));
  Brainfuck bf = new Brainfuck(s);
  bf.evaluate(bf.parse(hello), s.initial());
}

The key is using delegation instead of inheritance to override behavior and spending six hours typing code like a madman.

The language is defined by the effect that its eight instructions have on a memory:

module type SEMANTICS = sig
  type mem
  val mem       : mem
  val forward   : mem -> mem
  val backward  : mem -> mem
  val increment : mem -> mem
  val decrement : mem -> mem
  val output    : mem -> mem
  val input     : mem -> mem
  val repeat    : (mem -> mem) -> (mem -> mem)
end

The initial state of the memory is given by mem. The next six functions correspond directly to the six instructions > < + - . , while the operational semantics of each of the brackets is consolidated into a single small-step function, repeat, so that for a properly balanced Brainfuck program [P] its effect on mem is repeat P mem.

Given a SEMANTICS, the interpretation of a Brainfuck program is straightforward:

module Brainfuck (S : SEMANTICS) = struct

The program is represented by an abstract syntax comprising of a sequence of instructions mapping one-to-one to the surface syntax of the language, except for the pair of brackets that represents a single unit of repetition:

type instr = Fwd | Bak | Inc | Dec | Out | Inp | Rep of program
  and program = instr list

Note: The fact that properly balanced [ and ] denote a repeating computation can be taken as a definition. The translation from concrete syntax to abstract syntax ensures that the mapping is faithful and the operational semantics given for [ and ] as separate instructions corresponds to the one for Rep.

Since the semantics of each instruction is a function from memory to memory, the semantics of a program is the left-to-right composition of the semantics of the individual instructions comprising it:

let rec eval p t = List.fold_left (fun t -> function
  | Fwd   -> S.forward   t
  | Bak   -> S.backward  t
  | Inc   -> S.increment t
  | Dec   -> S.decrement t
  | Out   -> S.output    t
  | Inp   -> S.input     t
  | Rep p -> S.repeat (eval p) t) t p

The translation from concrete to abstract syntax is given by a parser:

let parse str =
    let syntax_err fmt = Printf.kprintf failwith fmt in
    let res = ref []
    and stk = ref [] in
    let emit  i  = res := i :: !res
    and enter () = stk := !res :: !stk; res := []
    and leave () = match !stk with
    | r :: s -> stk := s; res := Rep (List.rev !res) :: r
    | []     -> syntax_err "unmatched `]'"
    in String.iter (function
    | '>' -> emit Fwd
    | '<' -> emit Bak
    | '+' -> emit Inc
    | '-' -> emit Dec
    | '.' -> emit Out
    | ',' -> emit Inp
    | '[' -> enter ()
    | ']' -> leave ()
    |  _  -> ()) str;
    if !stk != [] then syntax_err "unmatched `['" else
    List.rev !res

Only the concrete instructions > < + - . , [ and ] have meaning in a program; all other possible characters are ignored and [ and ] must be properly balanced. An adjoint pretty-printer or formatter translates abstract syntax back to concrete, such that for all abstract programs P, parse (format P) ≡ P:

let format prog =
    let buf = Buffer.create 72
    and len = ref 0 in
    let putc c =
      if !len == 72 || c == '\n' then begin
        Buffer.add_char buf '\n';
        len := 0
      end;
      if c != '\n' then begin
        Buffer.add_char buf c;
        incr len
      end
    in
    let rec go prog = List.iter (function
    | Fwd   -> putc '>'
    | Bak   -> putc '<'
    | Inc   -> putc '+'
    | Dec   -> putc '-'
    | Out   -> putc '.'
    | Inp   -> putc ','
    | Rep p -> putc '['; go p; putc ']') prog
    in go prog; putc '\n'; Buffer.contents buf
end

(The proof of this last property would go by mutual induction on format and parse if it were not that they are imperative; the conversion to pure recursive form is straightforward and mechanical but left as an exercise to the reader.) It is time to give concrete semantics for each of the instructions.

The simplest possible store is a tape of memory cells which can be conveniently represented by a zipper:

type tape = Tape of int list * int * int list

The initial memory has an infinite number of zeros:

module S0 = struct
  type mem = tape

  let mem =
    let rec zeros = 0 :: zeros in
    Tape (zeros, 0, zeros)

Advancing and retreating the cell pointer simply moves the zipper, since both ends are infinite:

let forward  (Tape (ls, c, r :: rs)) = Tape (c :: ls, r, rs)
  let backward (Tape (l :: ls, c, rs)) = Tape (ls, l, c :: rs)

(The cyclic lists at both ends can be considered as sentinel nodes.) The simplest representation for a cell is as the underlying machine integer:

let increment (Mem (ls, c, rs)) = Mem (ls, c + 1, rs)
  let decrement (Mem (ls, c, rs)) = Mem (ls, c - 1, rs)

Character output is consistently among implementations just the output to console of the cell value as an ASCII code:

let output (Mem (_, c, _) as t) =
    output_char stdout (char_of_int (c land 255));
    t

Character input differs on the treatment of end-of-file; probably the most general behavior is to not modify the cell on EOF:

let input (Mem (ls, c, rs) as t) =
    let c = try int_of_char (input_char stdin) with End_of_file -> c in
    Tape (ls, c, rs)

Repetition is the effect of p whenever the current cell is not zero:

let rec repeat p t = match t with
  | Mem (_, 0, _) -> t
  | _             -> repeat p (p t)
end

One variation between interpreters is the EOF behavior. Some of them set the current cell to 0 on EOF:

module S1 (S : SEMANTICS with type mem = tape) = struct
  include S

  let input (Tape (ls, c, rs)) =
    let c = try int_of_char (input_char stdin) with End_of_file -> 0 in
    Tape (ls, c, rs)
end

Others set it to -1:

module S1' (S : SEMANTICS with type mem = tape) = struct
  include S

  let input (Tape (ls, c, rs)) =
    let c = try int_of_char (input_char stdin) with End_of_file -> -1 in
    Tape (ls, c, rs)
end

Another variation between interpreters is the cell width in bytes. The standard reference interpreter uses eight-bit cells:

module S2 (S : SEMANTICS with type mem = tape) = struct
  include S

  let increment (Tape (ls, c, rs)) = Tape (ls, (c + 1) land 255, rs)
  let decrement (Tape (ls, c, rs)) = Tape (ls, (c - 1) land 255, rs)
end

(To avoid redefining all functions I simply mask off excess bits from the native integer representation.) Yet another possible variation is to have a fixed-size tape. What to do when trying to move the cell pointer past the ends is a free choice; I could opt for simply ignoring the attempted movement:

let fill v =
  let rec go l n =
    if n == 0 then l else
    go (v :: l) (pred n)
  in go []

module S3 (N : sig val size : int end)
          (S : SEMANTICS with type mem = tape) = struct
  include S

  let mem = Tape ([], 0, fill 0 N.size)

  let forward  t = match t with
  | Tape (ls, c, r :: rs) -> Tape (c :: ls, r, rs)
  | Tape (_ , _, []     ) -> t

  let backward t = match t with
  | Tape (l :: ls, c, rs) -> Tape (ls, l, c :: rs)
  | Tape ([]     , _, _ ) -> t
end

The other alternative is to wrap the cell pointer at either end:

module S3' (N : sig val size : int end)
           (S : SEMANTICS with type mem = tape) = struct
  include S3 (N) (S)

  let forward  t = match t with
  | Tape (ls, c, r :: rs) -> Tape (c :: ls, r, rs)
  | Tape (ls, c, []     ) ->
    let r :: rs = List.rev (c :: ls) in
    Tape ([], r, rs)

  let backward t = match t with
  | Tape (l :: ls, c, rs) -> Tape (ls, l, c :: rs)
  | Tape ([]     , c, rs) ->
    let l :: ls = List.rev (c :: rs) in
    Tape (ls, l, [])
end

Now the choice of semantics is a matter of stacking the desired functors:

let hello = "
>+++++++++[<++++++++>-]<.>+++++++[<++++>-]<+.+++++++..+++.>>>++++++++[<++++>-]
<.>>>++++++++++[<+++++++++>-]<---.<<<<.+++.------.--------.>>+."

let _ =
  let module S  = S3'(struct let size = 5000 end)(S2(S1(S0))) in
  let module BF = Brainfuck(S) in
  BF.(eval (parse hello)) S.mem

This makes BF a Brainfuck interpreter running on a bounded memory of 5.000 byte-sized cells with 0 on EOF.

1 comment:

startdev.mramirez said...

Cool. I have read about Brainfuck as a working joke.

But, modules are something that todays "hip" programming languages}, seems not to consider, or not enforce (like PHP, C, C++, where modules weren't consider, until later).