Tuesday, January 1, 2013

Code Generators, Rewrite Rules, Aliasing and the Coq

DDC 0.3.1 was pushed onto Hackage just before Christmas.

The main features in this new release are:
  • Back end code generation via C and LLVM that accepts the new core language. There is enough implemented to compile first order programs, but we'll need to implement a lambda lifter for the new core language before we can compile higher order programs directly.
  • Lots more program transformations. You can apply program transformations to core programs on the command line, and check that the code is being optimised the way it should be. There is a tutorial describing how to do this. 
  • A rewrite rule framework that understands the region and effect system. Rewrite rules work on effectful expressions, and can be given predicates so they only apply when particular effects are non-interfering. This lets us perform build-fold fusion when the builder and folder have (non-interferring) side effects. (more info in Amos Robinson's honours thesis)
  • Propagation of no-aliasing meta-data to the LLVM code generator. We have extended the core language with witnesses of distinctness, that prove particular regions of the store are distinct (do not alias). When LLVM has this meta-data it can perform more optimisations on the generated code that it would otherwise be able to. (more info in Tran Ma's honours thesis)
The main missing feature is:
  • No type inference, so the core programs contain lots of type annotations. We'll need to get the type inferencer back online before the language will be useful to end-users.
Our immediate plans are to get a paper together about the new core language and how it supports the rewrite rules and LLVM meta-data. There are interesting stories to tell about all of these things. After that, we'll work on the type inferencer.

For the paper, we'll need a proof that the core language does what it's supposed to. To get this, I've started adding the region and effect system to the Coq formalisation of SystemF2+MutableStore I did about a year ago. I hadn't touched Coq since then, so eased my way back into it by cleaning up the proofs I had already done, then packaged them up into the Iron Lambda collection.

Iron Lambda is a collection of Coq formalisations for functional core languages of increasing complexity, with the SystemF2+Regions+Effects language I'm working on now being the latest one. I made a wiki page for Iron Lambda because I think it will be useful in its own right, to help intermediate Coq users bridge the gap between the end of the Software Foundations course and what appears in current research papers. The biggest language in Software Foundations is Simply Typed Lambda Calculus (STLC)+Records+SubTyping, but it doesn't include polymorphism or indirect mutual recursion. To deal with SystemF style polymorphism we need a better approach to names than unique identifiers (ie, something like deBruijn indices). To deal with indirectly mutually recursive language definitions we need to know how to define custom own induction schemes. Iron Lambda shows how to do these things.

For the impatient:

DDC 0.3.1 
 $ cabal update
 $ cabal install ddc-tools

Iron Lambda
 $ darcs get http://code.ouroborus.net/iron/iron-head/

Thursday, February 2, 2012

Vectorisation without Replication in Data Parallel Haskell

Here is a Barnes-Hut gravitation simulation written using Data.Vector and Gloss.



If done naively, such an n-body simulation has a runtime complexity of O(n^2) because we need to consider the interaction of every body with every other body. Barnes-Hut performs an approximation that reduces this to O(n . log n). At each step in the simulation we insert the bodies (grey) into a quad-tree (green), and compute the centroid for each branch (blue). Now, if some other body is sufficiently far away from a centroid, then we use the centroid to approximate the force due to the bodies in the corresponding branch, instead of inspecting each one individually.

Now you've seen the video, the following graph sums up my work on Data Parallel Haskell (DPH) for the past six months:

This is a plot of runtime vs the number of bodies in the Barnes-Hut simulation. The simulation in the video uses just 1000 bodies, but my understanding is that the physicists need millions or billions to do useful physics work. Back to the graph, the red line is the program from the video, which uses Data.Vector to store its arrays and runs sequentially. The brown line is using the DPH library that shipped with GHC 7.2. The blue line is using the DPH library that will ship with GHC 7.4. Note how the asymptotic complexity of the program with New DPH is the same as with Data.Vector.

In Old DPH we were using a naive approach to vectorisation that resulted in the quad-tree being replicated (copied) once for every body in the simulation (bad!). In New DPH we compute the force on each body in parallel while sharing the same quad-tree. There is a story about this, and you can read all about it when we finish the paper.

Of course, asymptotic complexity isn't everything. The DPH program is still running about 10x slower than the Data.Vector program for large input sizes, and I'm sure that production C programs would run at least 5x faster than that. There's much low hanging fruit here. DPH misses many standard optimisation opportunities, which results in numerical values being unboxed and reboxed in our inner loops. There are also algorithmic improvements to the library that are just waiting for me to implement them. If I can make the blue line cross the red line in the next six months, then I'm buying everyone a beer.

Monday, January 23, 2012

The new Disciple Core language

It's been a while since updates. The last one was in August, and after that I got distracted with ICFP. When I got back I spent more time doing Coq proofs, finishing with Progress and Preservation for a version of SystemF2 with mutable algebraic data. The proofs are here. All data is allocated into the store (heap), and there are primitive functions to update it. In this language you can, say, destructively update a list so that the tail points to different physical data. This language doesn't have effect typing, but after this proof I felt like I had crossed the line from "Coq-shy" to "Coq-sure".

In early November I read about ciu-equivalence, and how to prove contextual equivalence of program transformations. Thinking ahead, it felt like time to cleanup the sponginess in the existing DDC core language, because I'd need to do this before trying to formalise it. Although the plain calculus has a proper semantics and a hand-done soundness proof, the actual core language as used in the compiler also has some half-baked hacks. The reviewers of a previous paper had made suggestions about how to cleanup other aspects of the core calculus, and having Coqified all those languages, I now understand why it was good advice.

Cutting to the chase, I've redesigned the DDC core language and there is an interpreter that works well enough to evaluate simple recursive functions. Here is an example:
  letrec {
   ack    [r1 r2 r3: %] 
          (m : Int r1) {!0 | Use r1 + Use r2 + Use r3}
          (n : Int r2) { Read r1 + Read r2 + Alloc r3
                       | Use r1  + Use r2  + Use r3}
          : Int r3
    = letregion r4 in
      let zero = 0 [r4] () in
      let one  = 1 [r4] () in
      case eqInt [:r1 r4 r4:] m zero of {
          1 -> addInt [:r2 r4 r3:] n one;
          _ -> case eqInt [:r2 r4 r4:] n zero of {
                  1 -> ack [:r4 r4 r3:] 
                           (subInt [:r1 r4 r4:] m one) 
                           (1 [r4] ());

                  _ -> ack [:r4 r4 r3:] 
                           (subInt [:r1 r4 r4:] m one)
                           (ack [:r1 r4 r4:] m (subInt [:r2 r4 r4:] n one));
          }
    }
  } in ack [:R0# R0# R0#:] (2 [R0#] ()) (3 [R0#] ());;
Here is another example that does destructive update of an integer:
  letregion r1 with {w1 : Mutable r1} in
  let x : Int r1 = 0 [r1] () in
  let _ : Unit   = updateInt [:r1 r1:] < w1 > x (2 [r1] ()) in
  addInt [:r1 r1 R2#:] x (3 [r1] ());;
and one that suspends the allocation of an integer with lazy evaluation:
  letregion r1 with { w1 : Const r1; w2 : Lazy r1; w3 : Global r1 } in
  let x : Int r1 lazy < w2 > 
     = purify < alloc [r1] w1 >  in
       forget < use [r1] w3 w1 > in
       2 [r1] () 
  in addInt [:r1 R0# R0#:] x (3 [R0#] ());;

Note that this is an intermediate representation for a compiler, and has vastly more type information than a real user would want to write. The compiler will perform type inference on the source language, and automatically translate the user program to this lower level language.

The new DDC core language is described on the wiki and so far I've been reasonably good at keeping the wiki up to date with what's implemented.

The main changes are:
  • Witnesses now exist in their own universe, at level 0 next to values. Both values and witnesses are classified by types, and types classified by kinds. This removes the dependent-kinding of the old language. The more I thought about it, the less fun formalising a dependently kinded language seemed to be.

  • I've removed the more-than constraints on effect and closure variables. The type inference algorithm I am using returns constraints on effect and closure variables, so I had added similar constraints to the core language. This was a bad idea because managing the additional subsumption judgements during core type checking is a headache. The new core language has no such constraints, and I think I know to process the output of the inferencer so I won't need them.

  • Introducing lazy evaluation is now done with the let .. lazy binding form instead of a primitive suspend operator. This goes hand-in-hand with the purify and forget keywords that mask the effect and closure of their body respectively. These two keywords are akin to the type casts used by SystemFc and the GHC core language to support GADTs. I think it makes more sense to mask out the effect of an expression before suspending it, than to pass a witness as to why it's pure. The former style will still be used in the source program because that's how the type inferencer works, but the latter should be easier to work with in the core language.
Anyway, if you darcs get the DDC repo you can make bin/ddci-core to build the interpreter. Run the examples like bin/ddci-core test/ddci-core/30-Eval/30-Letrec/Test.dcx. It's not completely finished, but all the examples under the test/ddci-core directory run fine.

The interpreter should be done in another couple of weeks. After that it'll be time to split off the LLVM backend from the existing compiler so that we can compile core programs directly.

Saturday, August 6, 2011

How I learned to stop worrying and love de Bruijn indices

When I started learning how to formalise languages in Coq, one of my initial stumbling blocks was choosing an approach to name binding. There are at least three fancy sounding contenders: Locally Nameless, Nominal Reasoning and Parametric Higher Order Abstract Syntax. Of course, you could also use good old de Bruijn indices, but proponents of fancy approaches warn about the large volume of lemmas regarding lifting and substitution you'll need to prove before you can start the "real work".

Back in April I scratched around for about two weeks reading up on all the different approaches, trying to identify the "best" one (whatever that means). Then I read the following comment by Xavier Leroy on the POPLmark website. He had used the Locally Nameless approach, but had stated: "If the purpose is just to solve the challenge and be done with it, I would rather go for pure de Bruijn". That was enough to convince me to stop worrying and just get into it. I've spent about 250 hours now doing my own proofs, and I can report that using de Bruijn isn't really that bad. Yes, you need to prove some lemmas about lifting and substitution, but my entire response to naysayers is summed up henceforth: "The only computational operation that lambda calculus has is substitution, so don't pretend you can get away without proving something about it." Corollary: HOAS doesn't count.

The Lemmas

You need to know what the lemmas are though, and I advise cribbing them from someone else's proof. I learned them from Arthur Charguéraud's proof, but I don't know where they are from originally. You're more than welcome to copy them from mine.

Once you know what the lemmas are, and how they work in STLC, extending them to new languages is straightforward. For reference, the following are the definitions of the lifting and substitution operators for STLC. Note that if you just want to prove Progress and Preservation for STLC, you don't actually need the lemmas. They're needed in bigger languages such as SystemF2 to show that type substitution is commutative. Nevertheless, for illustrative purposes we'll just stick with STLC expressions.

Inductive exp : Type :=
 | XVar  : nat -> exp
 | XLam  : ty  -> exp -> exp
 | XApp  : exp -> exp -> exp.

Fixpoint 
 liftX  (d:  nat) (* current binding depth in expression *)
        (xx: exp) (* expression containing references to lift *)
        : exp
 := match xx with 
    |  XVar ix    
    => if le_gt_dec d ix
        (* Index is referencing the env, so lift it across the new elem *)
        then XVar (S ix)

        (* Index is locally bound in the expression itself, and 
           not the environment, so we don't need to change it. *)
        else xx

    (* Increase the current depth as we move across a lambda. *)
    |  XLam t1 x1
    => XLam t1 (liftX (S d) x1)

    |  XApp x1 x2
    => XApp   (liftX d x1) (liftX d x2)
    end.

Fixpoint
 substX (d:  nat) (* current binding depth in expression *)
        (u:  exp) (* new expression to substitute *)
        (xx: exp) (* expression to substitute into *)
        : exp 
 := match xx with
    | XVar ix 
    => match nat_compare ix d with
       (* Index matches the one we are substituting for. *)
       | Eq  => u
       
       (* Index was free in the original expression.
          As we've removed the outermost binder, also decrease this
          index by one. *)
       | Gt  => XVar (ix - 1)

       (* Index was bound in the original expression. *)
       | Lt  => XVar ix
       end

    (* Increase the depth as we move across a lambda. *)
    |  XLam t1 x2
    => XLam t1 (substX (S d) (liftX 0 u) x2)

    |  XApp x1 x2 
    => XApp (substX d u x1) (substX d u x2)
 end. 

Drawing the deBruijn environment

Here is a nice lemma to get started.
Lemma lift_lift
 :  forall n m t
 ,  lift n              (lift (n + m) t) 
 =  lift (1 + (n + m)) (lift n t).
This is one of the lemmas you need for proving commutativity of substitution. Although it looks complicated at first glance, it will make significantly more sense when you know how to draw it.

First, consider the standard typing judgement form:
Env |- Exp :: Type
Assume there are de Bruijn indices in the expression. Some of these indices point to lambdas in the expression itself, corresponding to locally bound variables, while some point to lambdas in the environment, corresponding to free variables. Here is an example:



Suppose I apply (lift 2) to the above expression. Although it would be helpful for educational purposes to do this directly using the definition of 'lift' above (try it!), I'll tell you the shortcut instead. Firstly, if we assume that the definition of 'lift' is correct, none of the bound indices will be changed during the application. Ignoring bound indices, the only ones remaining are free indices. These are the indices that point into the environment.

Now, although applying 'lift' to an expression may increment these free indices, instead of thinking about indices being incremented, it's easier to think about environment elements being shifted. I'll explain what I mean by that in a moment, but for now let's just do it...

Here is the same expression as before, with the environment positions numbered in blue:


Shifting the elements after position two yields the following:



In the above diagram I've used '_' to represent a hole in the environment. This is a place where I could insert a new element, and be guaranteed that none of the indices in the expression point to this new element. This in turn means that if I apply (lift 2) to the expression, then insert a new element into the hole, then the resulting judgement is guaranteed to give me the same type as before (t5 in this case).

Of course, applying 'lift' to the expression doesn't actually change the environment. In this example I've moved 't1' and 't2' to the left for illustrative purposes, but this is an operation separate from applying 'lift' to the expression itself. However, applying 'lift' does changes the indices, and these indices are pointers into the environment. If we like, we could instead think about moving the pointers (the arrow heads) instead of the actual environment elements. This is an important change in perspective. Instead of thinking about 'lift' as incrementing indices in the expression, we want to think about 'lift' as shifting environment pointers to the left.


Let's look at our lemma again:
Lemma lift_lift
 :  forall n m t
 ,  lift m              (lift (n + m) t) 
 =  lift (1 + (n + m)) (lift n t).

As 't' is quantified, we need to prove this for all possible terms 't'. Using our change in perspective, this also means that we want to prove the lemma for all possible sets of environment pointers. Given the environment pointers for 't', the expression '(lift (n + m) t)' transforms them into a new set (by adding a hole at position n + m). Likewise the expression 'lift (1 + (n + m)) (lift n t)' transforms the environment pointers of 't' by adding a hole a position 'n', and then adding another hole at position '1 + (n + m)'. Here is a picture of the full situation. I've just drawn the contiguous set of environment pointers as rectangles, with the holes are marked in grey.


In the first line we have set of environment pointers for 't'. Applying 'lift n' adds a hole a position 'n'. Alternatively, applying 'lift (n + m)' adds a hole at position '(n + m)'. Given 'lift n t', if we add another hole at (1 + n + m) we get the last set shown. Likewise, if we take the set 'lift (n + m) t' and add a hole at position 'n' then we also get the last line.

That's it. That complicated looking lemma above can be drawn with a few rectangles, and provided the definition of 'lift' does what it's supposed to, the lemma is obviously true. For a language like SystemF2 you need about 5 lemmas concerning lifting and substitution. They're all easy to draw, and the proofs are straightforward once you understand the general approach. In the next post I'll show how to draw substitution, which works in a similar way.

Wednesday, May 25, 2011

Proofs and mutual recursion

Quick update. I'm still working on mechanising the proofs for the DDC core language in Coq. So far I've made it through Progress, Preservation and the correspondence between Big and Small step semantics for Simply Typed Lambda Calculus (STLC), System-F, System-F2, and PCF. Extending the STLC proof with polymorphism and fixpoints etc was surprisingly straightforward. It only took about 4 hours each to create the other versions of the proof.

The next qualitative hurdle appears to be handling the mutual recursion that arises when we extend the language in other ways. To add case expressions, I need to represent case alternatives. Case alternatives contain expressions, and expressions contain (lists of) alternatives.

Unfortunately, with mutually recursive definitions, it appears that Coq can't automatically generate all the induction schemes I need. I could be wrong on this though, as I'm still a Coq newbie. I've figured out that the Coq command "Combined Schemes", will make some of the schemes I might want, but it doesn't cover all of the use-cases. The problem (I think) is that although expressions are defined in terms of lists of alternatives, the lists themselves are polymorphic and aren't directly defined in terms of alternatives. For this reason, I can't convince "Combined Scheme" to make me a strong enough induction scheme, so I must write down and prove this myself.

At least, that's what I'm inferring from reading the proof of soundness for the STG machine described in a paper by Maciej Pirog and Dariusz Biernacki. I'm still working through it.

Tuesday, April 19, 2011

Falling down the naming well

Over the last month or so I've started to formalise the proofs of the DDC core language in Coq. The existing proofs were just getting too big to manage by hand, and typing them all up in Latex was seeming like a less and less fun thing to do.

Being a Coq-newbie I started with Benjamin Pierce and co's Software Foundations course notes. I had done a couple of pages of Coq tutorial before, but never tried to use it for anything real. I've found the course notes to provide a fairly smooth ride, though I got antsy during the IMP formalisation and skipped straight to the lambda calculus one. After about two weeks of evenings I was able to redo the simply typed lambda calculus (STLC) one by myself, but hit a wall trying to scale up to System-F. To cut a longer story short, I've come to appreciate (along with many others) what a massive hack capture avoiding substitution is.

I'm fast moving towards the viewpoint that the deBruijn and/or locally nameless representation is actually the "real" lambda calculus, whereas named binders are just a layer of syntactic sugar that confuses everything (and gives you cavities). I've ended up just settling on vanilla deBruijn indices for further work, mainly because the examples I've seen so far using the locally nameless representation rely on a heavy amount of Coq Ltac magic that I don't understand yet. I'm also more than willing to trade time for cleverness. By that I mean I'm willing to commit more time for a less clever solution, provided I know that the approach I take will get the job done and I won't have to start all over again using a different formalisation of binders.

Underneath all of this is an itch in my mind saying that I was already writing a perfectly good paper about the DDC core language, and I don't want to start another one right now about "How to formalise DDC Core in Coq". Anyway, I've made it through Progress and Preservation for STLC using the deBruijn representation, including all the list lemmas, and am feeling fairly good about it so far. The proofs are in the main DDC tree at http://code.ouroborus.net/ddc/ddc-head/proof/.

Tuesday, March 29, 2011

Real Time Edge Detection in Haskell


Last week we submitted a paper to ICFP about how to implement efficient parallel stencil convolutions in Haskell. This goes along with a new release of Repa that runs on GHC 7.0.3. We've extended the old array representation to support the partitioning of arrays into several regions, and rewritten the back-end to recover sharing between the computation of adjacent array elements. Coupled with a careful management of aliasing issues, this lets us implement image processing algorithms that have comparable performance to OpenCV.

Absolute performance depends on what data format we're using. Using Word8s is about 4 times slower than OpenCV because it uses SIMD intrinsics that we don't have access to from Haskell (yet!). On the other hand, using Float32s and 2 Haskell threads can run faster than single threaded OpenCV, as the parallelism makes up for the lack of SIMD instructions.

Here is a video of me demonstrating a Canny edge detector applied to a stream captured from my iSight camera. The demo uses Objective C for the GUI, and Haskell for the processing. The source builds with XCode 3.2.1 and GHC 7.0.3, though you'll need to update and run the CONFIGURE.sh script in the root dir to point it to your GHC install. There are also prebuilt versions with the GHC "+RTS -N" option baked in for two, four and six threads. They all do the same thing apart from the number of threads. You can change the RTS options in the main.m module if you want to rebuild it. On my 4-Core i7 desktop it does about 20 frames per second with no problem -- the framerate from the iSight is set dynamically by OSX depending on system load. I've tested it on 3 separate machines, but let me know if it doesn't work for you. Non-Mac users can run the edge detector over .bmp files, using the repa-canny program from repa-examples.