Lecture 2 - Demo - While language interpreter

The goal of this demo is to show how to build an interpreter for a simple language called While.
First we describe briefly the language, then explain how the interpreter is implemented in OCaml.

While language, informally

A program in a While language consists of stmt that can be For instance, a While program computing a factorial of 6 is written as follows:
            i = 1
            r = 1
            while i <= 6:
              r = i * r
              i += 1
            print("Computing factorial of 6 =", r)
        
Note that indentation is significant in case of conditionals and while loop (just like in Python).
Note also that expressions are simply either

The interpreter development

To build the interpreter for While language, a set of OCaml files can be downloaded here: while.gz)
together with Makefile and dune files to compile and run the interpreter and the test file.

Once uncompressed (with tar zxvf while.tar.gz), you get a directory while/ with the following files:

ast.ml abstract syntax of While language (focus of this demo)
interp.ml interpreter (focus of this demo)
while.ml main file (produced the interpreter that can be run on the test file)
lexer.mll lexical analysis (used by the main file to analize the source text of the input)
parser.mly parsing (used by the main file to construct the AST of the input)
Makefile/dune to automate the build
test.wh to test the interpreter (input file for this demo)

The project compiles (run make, that launches dune build).

The executable takes a While file on the command line, with suffix .wh. When it is absent, file test.wh is used. When running make, the interpreter is run on file test.wh.

Abstract Syntax (ast.ml)

Parse trees are the output of the parser and the input of the interpreter.

Identifiers

Identifiers (variables) are normally represented by a value of type string. However, to signal better the syntax errors
(e.g. when an unknown identifier is used in the program), we wrap identifier's type into the following record type ident,
using additional field loc that tracks the position (line & column) for each identifier:

    type location = Lexing.position * Lexing.position
    type ident = { loc: location; id: string; }

Expressions

We define the grammar of the expressions as follows:


    (* Unary operators. *)
    type unop =
      | Uneg (* -e *)
      | Unot (* not e *)

    (* Binary operators. *)
    type binop =
      | Badd | Bsub | Bmul | Bdiv | Bmod    (* + - * // % *)
      | Beq | Bneq | Blt | Ble | Bgt | Bge  (* == != < <= > >= *)
      | Band | Bor                          (* and or *)

    (* Constants. *)
    type constant =
      | Cbool of bool
      | Cstring of string
      | Cint of int

    (* Expressions. *)
    type expr =
      | Ecst of constant                   (* constant *)
      | Eunop of unop * expr               (* unary operation *)
      | Ebinop of binop * expr * expr      (* binary operation *)
      | Eident of ident                    (* variable *)

Note how the expr type is defined recursively.

Statements

Finally, we define the abstract syntax for the statements and the program:

   (* Statements. *)
    type stmt =
      | Sif of expr * stmt * stmt       (* conditional *)
      | Sassign of ident * expr         (* modifying a variable *)
      | Sblock of stmt list             (* a sequence of statements *)
      | Sprint of expr list             (* printing a list of expressions *)
      | Swhile of expr * stmt           (* while loop *)

    (* a program is simply a statement. *)
    type file = stmt
    
Note the type list in the constructors Sblock and Sprint. Before looking at how the interpreter is implemented,
let us take a closer look on the type of lists in OCaml.

Lists (OCaml, general knowledge)

(The text below is translated from French from these notes written by Jean-Christophe Filliâtre.)


In OCaml, there is a predefined type of lists. These lists are immutable (not modifiable in place) and homogeneous (all elements of a list are of the same type). If 'a denotes the type of elements of a list (we call 'a a type variable), then the list has the type 'a list.

Lists are constructed from the empty list [] and by adding an element at the head of a list with :: which is an infix operator. Thus, the list containing the integers 1, 2, and 3 can be defined as follows:
  # let l = 1 :: 2 :: 3 :: [];;
  val l : int list = [1; 2; 3]
As seen in this display, OCaml offers a more concise syntax to directly construct a list by extension, by specifying all its elements:
  # let l = [1; 2; 3];;
  val l : int list = [1; 2; 3]
Many functions on lists are predefined: to access the first element, all the others, calculate the length, etc.
See the List module from the standard library for more details.

The power of OCaml's lists comes from the possibility of construction by case on the form of a list, called pattern matching. A list is indeed either empty or formed by a first element and another list, and the match with construction allows reasoning by cases according to the form of a list. One can write a function calculating the sum of the elements of a list of integers like this:
    # let rec sum l =
        match l with
        | []     -> 0
        | x :: r -> x + sum r;;
    val sum : int list -> int = 
  
The match with construction consists of an expression to consider (between the keywords match and with, here the list l) and one or more filtering cases separated by a vertical bar |. A filtering case consists of a pattern and an expression separated by an arrow. The pattern is a constructor (here [] or ::) and its arguments can be named (here the two arguments of :: are named x and r).

The semantics is intuitive: the examined expression is compared against the different patterns (in their order of appearance) and when there is a match for a pattern, the expression associated with that pattern is evaluated, in an environment where the variables of the pattern are associated with the corresponding values in the filtered expression.

For example, if we apply the sum function to the list 1::2::3::[], the filtering case that applies is the second one, x taking the value 1 and r taking the value 2::3::[]. We then evaluate the expression x + sum r in this environment. This filtering case will apply two more times, before we finally reach the list [] for which the first case will apply, returning the value 0. We will then have the expected result:
    # sum [1;2;3];;
    - : int = 6
  
It becomes clear now where the power of pattern matching comes from: it acts like a series of tests and local variable definitions at the same time, all with an extremely concise syntax. There's even a syntactic shortcut for pattern matching on the last argument of a function, using the function keyword. Thus, we can rewrite the sum function as simply as:
    let rec sum = function
      | []     -> 0
      | x :: r -> x + sum r
  

Interpreter (interp.ml)

We start by opening Ast and Format modules to get access to declarations from abstract syntax of While language and from Format module providing pretty-printing facilities respectively. As the name 'module' suggests, a module is primarily a way to introduce modularity into a program, i.e., to divide it into units.

    open Ast
    open Format
 

Utilities

We now declare a new exception Error to signal errors during program evaluation by our interpreter (The of string means that the exception can carry a meaningful message.
For instance, if we try to evaluate 1 / True in While language this will be a typing error (caught dynamically, because our While language does not have a type systems).

    exception Error of string
    let error s = raise (Error s)
The exception is thus simply raised by calling the function error, e.g. error "Type mismatch". Such an exception can also be caught later in the program, using try with construction whose usage is exactly the same as with match with pattern matching.

Next, we need a (semantic) notion of values, i.e. to what the interpreter evaluates the expressions to:



    (* Values. *)
    type value =
      | Vbool of bool
      | Vint of int
      | Vstring of string

Let us define pretty-printing functions for printing a value, and printing a list of values (which we will use for interpreting e.g. print(42+5, "hello")) command.

Note how print_values is defined recursively below with three cases (empty list [], singleton list [v], and a list v :: vtl whose head is a value v and tail is some list of values vtl).


    (* Print a value on standard output *)
    let print_value = function
      | Vbool true -> printf "True"
      | Vbool false -> printf "False"
      | Vint n -> printf "%d" n
      | Vstring s -> printf "%s" s

    (* Printing a list of values. *)
    let rec print_values vl = match vl with
      | [] -> printf "@."
      | [v] ->
        print_value v;
        printf "@."
      | v :: vtl ->
        print_value v;
        print_string " ";
        print_values vtl

        
Finally, values are also what is stored in the declared variables of a program. Therefore, we need to have a data structure representing such a "memory" and which allows us to add a new variable with the associated value to the memory, and to update or access the content of variables. Here we represent such a memory with type
    type memory = (string, value) Hashtbl.t
This type is passed to the OCaml functions of the interpreter below as parameter `ctx` of type memory.

Interpreting expressions

To interpret expressions, we define the following mutually recursive definitions (in OCaml recursive function is declared using let rec ... = ... construction and the subsequent mutually defined recursive functions are declared using and ... = ... syntax). In the definition of interp_expr below, the function takes as argument the memory ctx and the expression which it pattern matches with all possible cases (constant, unary expression, binary expression, and a variable):

let rec interp_expr ctx = function
  | Ecst c -> interp_const c
  | Eunop (op, e1) -> interp_unop ctx op e1
  | Ebinop (op, e1, e2) -> interp_binop ctx op e1 e2
  | Eident {id} -> try Hashtbl.find ctx id  with _ -> error "not found"
In the first three cases the corresponding interpretation function is called using variables c, op, e1, e2 introduced in the patterns.
In the case of the variable Eident {id} -> ... where is only look at the identifier field, we use the memory ctx to try to find the associated value try Hashtbl.find ctx id with _ -> error "not found" and if the variable with identifier id is not present in the memory, we catch the exception raised by the call of the library function Hashtbl.find and raise our own defined exception Error s, calling error "not_found"

Interpreting constants is trivial in our case (in other languages it might be less trivial):

and interp_const = function
  | Cbool b ->  Vbool b
  | Cstring s -> Vstring s
  | Cint n -> Vint n
For unary operations, we first evaluate the expression e1 making a recursive call inter_expr ctx e1 and then proceed by case analysis on what is the unary operator op.
Notice the errors for the wrong operand type in each case (unary minus and unary logical negation).

(* Interpreting unary operations. *)
and interp_unop ctx op e1 =
  let v1 = interp_expr ctx e1 in
  match op with
  | Uneg ->
    let v1 = interp_expr ctx e1 in
    begin match v1 with
      | Vint n1 -> Vint (-n1)
      | _ -> error "wring unary operand type: argument must be of integer type!"
    end
  | Unot ->
    begin match v1 with
      |  Vbool b1 -> Vbool (not b1)
      | _ -> error "wring unary operand type: argument must be of Boolean type!"
    end
For binary operations, we first make a case analysis whether the binary operator is used to return an integer value or a Boolean one,
dispatching each case to a dedicated interpreter function, merely passing the memory ctx the arguments , op, e1, e2.

and interp_binop ctx op e1 e2 =
  match op with
  | Badd | Bsub | Bmul | Bdiv | Bmod -> interp_binop_arith ctx op e1 e2
  | _ (* all other cases *) ->  interp_binop_bool ctx op e1 e2

and interp_binop_arith ctx op e1 e2 =
  let v1 = interp_expr ctx e1 in
  let v2 = interp_expr ctx e2 in
  match v1, v2 with
  | Vint n1, Vint n2 ->
    begin match op with
        | Badd -> Vint (n1 + n2)
        | Bsub -> Vint (n1 - n2)
        | Bmul -> Vint (n1 * n2)
        | Bmod -> Vint (n1 mod n2)
        | Bdiv -> if n2 = 0 then error "division by zero!" else Vint (n1 / n2)
        | _ -> assert false (* other operations excluded by asssumption. *)
    end
  | _ -> error "wring operand type: arguments must be of integer type!"
Note the pattern matching | _ -> assert false which excludes the cases where op is e.g. Band.
Indeed, for the arithmetic binary operations, we can safely assume that op can only be one of the binary
operations Badd | Bsub | Bmul | Bdiv | Bmod, since this is the context in which the interp_binop_arith is called from the function interp_binop.
The same goes for the implementation of the interp_binop_bool function (where we evaluate logical AND and OR cases efficiently):

and interp_binop_bool ctx op e1 e2 =
  (* We first treat cases where `op` is a logical operation `Band` or `Bor`
     separately for efficiency. *)
  if op = Band then
    begin match interp_expr ctx e1 with
      | Vbool b1 ->
        if b1 then begin match interp_expr ctx e2 with
          | Vbool b2 -> Vbool b2
          |  _ -> error "unsupported operand types"
        end
        else Vbool false
      | _ -> error "unsupported operand types"
    end
  else if op = Bor then
    match interp_expr ctx e1 with
    | Vbool b1 ->
      if b1 then Vbool true
      else begin match interp_expr ctx e1 with
        | Vbool b2 -> Vbool b2
        | _ ->  error "unsupported operand types"
      end
    | _ ->  error "unsupported operand types"
  else
    (* In all other binary comparison operations, we can evaluate both
       arguments first: *)
    let v1 = interp_expr ctx e1 in
    let v2 = interp_expr ctx e2 in
  match op with
    | Beq  -> Vbool (v1 = v2)
    | Bneq ->  Vbool (not (v1 = v2))
    | Blt  ->  Vbool (v1 < v2)
    | Ble  ->  Vbool (v1 <= v2)
    | Bgt  ->  Vbool (v1 > v2)
    | Bge  ->  Vbool (v1 >= v2)
    | _    -> assert false  (* other operations excluded by asssumption. *)
 

Interpreting statements and the program

Finally, we can now evaluate the statements and the program. Again, we proceed by pattern matching of the kind of statement (conditional, assignment, ...)
using recursion:


    (* Interpreting a statement *)
    let rec interp_stmt ctx = function
      | Sif (e, s1, s2) ->
        begin match (interp_expr ctx e) with
          | Vbool b1 -> if b1 then interp_stmt ctx s1 else interp_stmt ctx s2
          | _ -> error "wrong type : bool expected"
        end
      | Sassign ({id}, e1) -> Hashtbl.replace ctx id (interp_expr ctx e1)
      | Sblock bl ->  block ctx bl
      | Sprint el -> print_values (List.map (fun e -> interp_expr ctx e) el)
      | Swhile (e, s) ->
        match interp_expr ctx e with
        | Vbool b -> if b then (interp_stmt ctx s; interp_stmt ctx (Swhile (e, s))) else printf ""
        | _ -> error "wrong type : bool expected"

    and block ctx = function
      | [] -> ()
      | s :: sl -> interp_stmt ctx s; block ctx sl


    let file s = interp_stmt (Hashtbl.create 16) s

    
Let us consider each case separately.

Conditional

For conditional,
    | Sif (e, s1, s2) ->
        begin match (interp_expr ctx e) with
           | Vbool b1 -> if b1 then interp_stmt ctx s1 else interp_stmt ctx s2
           | _ -> error "wrong type : bool expected"
         end
we interpret the expression to a Boolean value and then evaluate the corresponding branche of the conditional recursively calling interp_stmt.

Assignment

For the assignment,
    | Sassign ({id}, e1) -> Hashtbl.replace ctx id (interp_expr ctx e1)
we use the Hashtbl function replace that stores the result of the evaluation of expression interp_expr ctx e1 in the memory ctx at the variable id
(if the variable id is not in the memory, it will be stored as a new variable, and if it is the memory already, its content is updated).

Block

For the block (i.e. a sequence) of statements,
    | Sblock bl ->  block ctx bl

we using mutual recursion, calling the block function. If the sequence is empty, it returns the unit value ().
Otherwise, we interpret the first statement recursively calling interp_stmt and the proceeding back to evaluate the rest of the block.
Note that thought each statement of the block is evaluated to unit (), the evaluation might affect the (global) memory ctx.

Printing

For printing,
    | Sprint el -> print_values (List.map (fun e -> interp_expr ctx e) el)

we use the defined Utility function print_values. However, that function is defined on the list of values, but here we have a list el of expressions!
We thus first need to turn the list of expressions el into a list of values, before calling the pretty-printing.
And here is where the power of functional programming can be used, calling a higher-order function List.map
which takes a function (here fun e -> interp_expr ctx e) and a list el and then applies that function to each element e of the list.
That is, List.map whose type is ('a -> 'b) -> 'a list -> 'b list is defined as
    let rec map f = function
      | []     -> []
      | x :: l -> (f x) :: map f l
 
and it is higher-order function, because its parameter f is a function itself.

While Loop

Finally for the while loop,
    | Swhile (e, s) ->
      match interp_expr ctx e with
        | Vbool b -> if b then (interp_stmt ctx s; interp_stmt ctx (Swhile (e, s))) else printf ""
        | _ -> error "wrong type : bool expected"
we first interpret the guard expression e. If the result of the evaluation is true (Vbool b, with b equal true),
then we interpret recursively a statment of the loop body s by calling interp_stmt ctx s and then interpret the loop again interp_stmt ctx (Swhile (e, s)).
Note that here we used OCaml operator e1 ; e2 which is nothing but a notation for let () = e1 in e2, i.e. notation for the case where the result of evaluation of OCaml expression e1 is of type unit ().
Alternatively, we could write instead interp_stmt ctx (Sblock s :: (Swhile (e, s))) using the notion of block of the While language.



back to the main page