Saturday, November 27, 2010

Shootout: Fannkuch redux

A few times after finishing some GHC hacking, I've been talking to Simon Marlow and mentioned that whatever it was I was doing now "works".
Usually, his reply is then "for what value of works?". Especially with compilers, there is quite a gap between "sorta works", "seems to work" and "industrial strength".

Last night I was taking a stroll through the "sorta" end of town while hacking on a Disciple version of the fannkuch-redux program from the languages benchmarks game. Disciple is a dialect of Haskell, so I started with one of the nicer existing Haskell versions.

Here is the version by Miha Vučkovič. I've reformatted it a bit and used the Disciple string pasting operator % instead of the (hated) Haskell ++, but it's otherwise the same program. (I've also added a type sig because DDC doesn't do dictionary passing yet, and used iterateL, but more on that later.)

flop (2:x1:t) = x1:2:t
flop (3:x1:x2:t) = x2:x1:3:t
flop (4:x1:x2:x3:t) = x3:x2:x1:4:t
flop (5:x1:x2:x3:x4:t) = x4:x3:x2:x1:5:t
flop (6:x1:x2:x3:x4:x5:t) = x5:x4:x3:x2:x1:6:t
flop (7:x1:x2:x3:x4:x5:x6:t) = x6:x5:x4:x3:x2:x1:7:t

flop lst@(h:_) = r
where flop' 0 (t, r) = (t, r)
flop' n ((h:t), r) = flop' (n-1) (t, h:r)
(t, r) = flop' h (lst, t)

flopS (1:_) = 0
flopS lst = 1 + flopS (flop lst)

rotate n (h:t) = rotate' (n-1) t
where rotate' 0 l = h:l
rotate' n (f:t) = f:(rotate' (n-1) t)

checksum :: Int -> Int -> Int
checksum i f
| mod i 2 == 0 = f
| True = -f

pfold :: (Int, Int) -> [(Int, Int)] -> (Int, Int)
pfold r [] = r
pfold (ac, af) ((c, f):t) = pfold (sc, sf) t
where sc = ac+c
sf = max af f

permut n = foldr perm [[1..n]] [2..n]
where perm x lst = concat [take x $ iterateL (rotate x) l | l <- lst]

main ()
= do n = 8
(chksm, mflops)
= pfold (0,0)
$ map (\(i, p) -> let flops = flopS p
in (checksum i flops, flops))
$ zip [0..] (permut n)

putStr $ show chksm % "\nPfannkuchen(" % show n % ") = " % show mflops % "\n"
It doesn't matter what it does, or how it does it. Check the shootout page if you're interested, we're just trying to get it running for now.

Faking value recursion

My first compile attempt reported:

desire:ddc-head-devel benl$ bin/ddc -O --make test/60-Shootout/Fannkuch-Redex/MainShoot.hs
[1 of 1] Compiling MainShoot
ddc: ERROR
Conflicting region constraints.

constraint: Direct %115
from the use of: flop
at: ./test/60-Shootout/Fannkuch-Redex/MainShoot.hs:16:29

conflicts with,
constraint: Lazy %115
from the use of: permut
at: ./test/60-Shootout/Fannkuch-Redex/MainShoot.hs:42:29

Confusing, yes, but at least it's a civilised message. What it's trying to say it thought the region named %115 was Direct, meaning it cannot contain thunks, but later code wants to add thunks to it (make it Lazy). Disciple tracks what objects can or cannot be thunks, because if the backend doesn't have to check for them then the code can be a lot faster.

The confusing thing is why it thinks this region is supposed to be Direct. Top level data (like global integers) default to Direct because when we create them we know they're fully evaluated, and not thunks. However, flop is a function so should be polymorphic in this aspect, it should not require its arguments to be Direct.

Ok, looking in the interface file for the inferred type of flop says:

flop :: Data.List.List %rTC0 (Int32 %rTC4) -(!e0 $c0)> Data.List.List %rTC0 (Int32 %rTC4)
:- $c0 = ${t : %rTC0} + ${t : %rTC4}
, !e0 = !Read %rTC0 + !Read %rTC4
, Const %rTC0
, Const %rTC4
, Direct %rTC0
, Direct %rTC4

What I'm looking at is the line:

$c0 = ${t : %rTC0} + ${t : %rTC4}
This is a closure constraint, and reveals that the inferencer thinks this function is referencing data in regions %rTC0 and %rTC4 from its environment. This might be true if it was referring to some top level piece of data, but it's not. The closure constraint also reports the variable being used to refer to this data, namely "t".

A closer inspection of flop reveals the underlying problem:

flop lst@(h:_) = r
where flop' 0 (t, r) = (t, r)
flop' n ((h:t), r) = flop' (n-1) (t, h:r)
(t, r) = flop' h (lst, t)

Here, Miha has been tricky and used value recursion in the last line. Notice that "t" appears on both the left and the right of the binding, but the binding isn't defining a recursive function -- it's a recursive value.

Disciple doesn't support value recursion yet, as it's tricky to implement in a default-strict language, but I'd like to look into it in the future. Anyway, in Disciple today only function bindings can be recursive, so what should have happened is that the renamer should have said that "t" is not in scope in the right of the binding. Instead, that screwup has gotten through the renamer, and now the inferencer thinks that "t" has been defined outside of "flop". It has then defaulted its region to Direct and Const, as there are no opposing Lazy or Mutable constraints.

The work around is to rewrite this as an actual function binding, then use laziness to break the recursive loop:

flop lst@(h:_) = snd (thing ())
where flop' 0 (t, r) = (t, r)
flop' n ((h:t), r) = flop' (n-1) (t, h:r)
thing () = flop' h @ (lst, fst @ (thing @ ()))

A @ on the right of a binding means "lazy function application". For example, the expression (f @ x) builds a thunk holding pointers to the code for "f" and its argument "x". The programmer decides where to suspend expressions, and they are forced automatically by consumers as needed. I should probably use a different operator because it looks weird with respect to the @ on the left of bindings, but no matter.

Changing this makes the program run fine, at least with an input size of 7 (the "n" in the main function).

desire:ddc-head-devel benl$ bin/ddc -O --make test/60-Shootout/Fannkuch-Redex/MainShoot.hs
[1 of 1] Compiling MainShoot

desire:ddc-head-devel benl$ ./a.out
Pfannkuchen(7) = 16

Sadly, setting n >= 8 gives

desire:ddc-head-devel benl$ ./a.out
*** DDC RTS PANIC! Slot stack overflow.
Abort trap

Explicit Laziness

The "slot stack" is the stack that holds pointers to heap objects. This is sometimes called a "shadow stack" and the pointers are used as roots for the garbage collector. The reason it's overflown is that Disciple is a default strict language, so our friends map, fold, filter etc evaluate the entire list when called. We can see where the stack has overflown by running the program in GDB, the GNU debugger.

desire:ddc-head-devel benl$ gdb a.out
GNU gdb 6.3.50-20050815 (Apple version gdb-1346) (Fri Sep 18 20:40:51 UTC 2009)
(gdb) run
Starting program: /Users/benl/devel/ddc/ddc-head-devel/a.out
Reading symbols for shared libraries ++. done
*** DDC RTS PANIC! Slot stack overflow.

Program received signal SIGABRT, Aborted.
0x93a74732 in __kill ()
(gdb) bt
#0 0x93a74732 in __kill ()
#1 0x93a74724 in kill$UNIX2003 ()
#2 0x93b0798d in raise ()
#3 0x93b1da44 in abort ()
#4 0x00012041 in _panicOutOfSlots ()
#5 0x00003a0c in Data_List__symCl ()
#6 0x00003c04 in Data_List_iterateL ()
#7 0x00014d17 in _forceStep ()
#8 0x00014db8 in _force ()
#9 0x000048a0 in Data_List_take ()
#10 0x000049c5 in Data_List_take ()
#11 0x0000363c in Data_Function__symDl ()
#12 0x0000679f in _permut_vCL2 ()
#13 0x0000434a in Data_List_concatMap ()
#14 0x00004362 in Data_List_concatMap ()
#15 0x00004362 in Data_List_concatMap ()
#16 0x00004362 in Data_List_concatMap ()
#17 0x00004362 in Data_List_concatMap ()
#18 0x00004362 in Data_List_concatMap ()
#19 0x00004362 in Data_List_concatMap ()
#20 0x00004362 in Data_List_concatMap ()
#21 0x00004362 in Data_List_concatMap ()
#22 0x00004362 in Data_List_concatMap ()
#23 0x00004362 in Data_List_concatMap ()
... 5000 more lines

Usefully, DDC compiles via C and also uses the regular C stack for return addresses and unboxed data. The intermediate C files look like they almost could have been written by a human, and there is no evil mangler butchering the assembly code once it's been compiled. This means we get real stack traces, and I don't have the same sense of impending doom as when a GHC compiled program segfaults on me.

From the stack trace, it looks a bit like it's run out of space in concatMap. There is no concatMap in the source code, but of course every Haskell hacker knows that list comprehensions are desugared to concatMap, so it's run out of space in the list comprehension.

Disciple is default strict, so if you want producer/consumer style list processing to run in constant space using laziness, then you have to say so. Here are new versions of permut and main after rewriting the list comprehension to use a map, and converting to the lazy list functions concatL mapL and zipL:

permut n = foldr perm [[1..n]] [2..n]
where perm x lst = concatL $ mapL (\l -> take x $ iterateL (rotate x) l) lst

main ()
= do n = 10
(chksm, mflops)
= pfold (0,0)
$ mapL (\(i, p) -> let flops = flopS p
in (checksum i flops, flops))
$ zipL [0..] (permut n)

putStr $ show chksm % "\nPfannkuchen(" % show n % ") = " % show mflops % "\n"

The functions concatL, mapL and zipL have similar definitions to the strict versions, but they're more lazy. This is like the difference between foldl and foldl' in the Haskell libraries. iterateL builds an infinite list, so there is no strict version called plain "iterate".

This makes n=8 work, but n=9 runs out of heap.

desire:ddc-head-devel benl$ bin/ddc -O --make test/60-Shootout/Fannkuch-Redex/MainShoot.hs
[1 of 1] Compiling MainShoot
desire:ddc-head-devel benl$ ./a.out
*** DDC RTS PANIC! Out of heap space.
current (full) heap size: 9999999 bytes
could not allocate another: 16 bytes
Abort trap
In a real runtime system, when the runtime heap is full it would ask for more from the OS. Unfortunately, the Disciple runtime is less featureful, uses a fixed size heap, and when it's full it's dead. It does have a garbage collector, but when all data in the heap is live there's nothing more the GC can do. However, you can set the size of the heap on the command line. Giving it 100Megs gets us n=9.

desire:ddc-head-devel benl$ ./a.out +RTS -H 100000000
Pfannkuchen(9) = 30

The Treasure Finder

Garbage collection is a misnomer. Garbage collectors don't collect unused data, they copy out live data to a new space. It'd be more accurate to call them "Treasure Finders". Anyway, for interests sake we can examine the rate treasure is being found in our program. The DDC runtime supports some simple statistics like so:

desire:ddc-head-devel benl$ ./a.out +RTS -H 100000000 -profile-gc
Pfannkuchen(9) = 30
desire:ddc-head-devel benl$ cat

-- Garbage Collection
collection count = 17

alloc total = 760,836,560 (bytes)

copy total = 1,034,618,572 (bytes)
count = 68,351,051 (objects)
avg object size = 15.137 (bytes)

process time user = 6.120 (s)
system = 0.070 (s)

mutator total = 1.760 (s)
user = 1.700 (s)
system = 0.060 (s)

collector total = 4.430 (s)
user = 4.420 (s)
system = 0.010 (s)

time efficiency = 28.433 (%)

Heh, 28% time efficiency isn't the best. The Disciple GC is a simple two space collector. It's not a generational GC and there is no nursery. The way the code is compiled also makes it hold onto far too much intermediate data. Fixing that would be equivalent do doing "register allocation" on the GC slot stack, so it doesn't treat an object as a root unless it really needs it.

Of course, the great thing about having 8GB of RAM in your desktop machine is that you can achieve so much with a naive language implementation. Let's compare against GHC to see how we're going:

desire:ddc-head-devel benl$ /usr/bin/time ./a.out +RTS -H 1000000000
Pfannkuchen(10) = 38
83.04 real 82.12 user 0.90 sys

desire:ddc-head-devel benl$ ghc -O2 --make test/60-Shootout/Fannkuch-Redex/MainGHC.hs -o Main
desire:ddc-head-devel benl$ /usr/bin/time ./Main 10
Pfannkuchen(10) = 38
2.10 real 2.08 user 0.01 sys

That's when I went to bed.

Wednesday, October 20, 2010

Effect inference

A few people have mentioned they are worried about the type signatures from my previous post.

For a sig like:

fillPrimes :: Array Int -> Int -> Int -> Int -> ()

The fact that there is no 'IO' constructor in that type doesn't mean that side effects aren't being tracked. The type inferencer knows what's going on, and so does the optimiser, you just don't have to write effect information in sigs if you don't want to.

If we add all the information about effects and mutability we actually have:

:: forall %r0 %r1 %r2 %r3 %r4
. Array %r0 (Int %r1) -> Int %r2 -> Int %r3 -> Int %r4 -(!e0)> ()
:- !e0 = !Read %r0 + !Read %r1 + !Write %r1 + !Read %r2 + !Read %r3 + !Read %r4
, Mutable %r1

A term like Array %r0 (Int %r1) means that the fillPrimes accepts an array of boxed integers. The pointers (handles) to those integers are in some region named %r0, and the integers themselves are in region %r1. The effect term !Write %r1 reveals that fillPrimes updates those integers, which requires them to be mutable, hence the Mutable %r1 constraint.

Writing sigs this way is a bit verbose, so we also support the following "short hand" syntax. This sig gets desugared to the previous one using some simple defaulting rules:

fillPrimes :: Array (Int{modify}) -> Int{read} -> Int{read} -> Int{read} -> ()

Full information about effects and mutability is exported in the interface files. I'm currently working on a system whereby you can say "These are the effects I think this function has, but tell me if you think it has more." I'll post more about this when it's done (it half works now).

Saturday, October 16, 2010

Code: Three primes programs

This post illustrates some of the programming styles possible in Disciple.

We'll start with ye'olde primes program straight from the nofib benchmark suite. This is the exact code, except that for the sake of example I've hard wired it to show the 1500th prime instead of taking a command line argument.

suCC :: Int -> Int
suCC x = x + 1

isdivs :: Int -> Int -> Bool
isdivs n x = mod x n /= 0

the_filter :: [Int] -> [Int]
the_filter (n:ns) = filter (isdivs n) ns

primes :: [Int]
primes = map head (iterate the_filter (iterate suCC 2))

main = print $ primes !! 1500

This program builds an infinite list of primes, and then uses list indexing to take the 1500'th one. A couple of nice functional programming paradigms are used, namely higher order functions and laziness. Disciple is a default-strict dialect of Haskell, so we can use almost the same code, except we have to be explicit when we want laziness:

suCC :: Int -> Int
suCC x = x + 1

isdivs :: Int -> Int -> Bool
isdivs n x = mod x n /= 0

the_filter :: [Int] -> [Int]
the_filter (n:ns) = filterL (isdivs n) ns

primes :: [Int]
primes = mapL head (iterateL the_filter (iterateL suCC 2))

main () = println $ show $ primes !! 1500

That's it. I've changed map, filter and iterate to mapL, filterL and iterateL, and wibbled the main function slightly, but it's otherwise the same code. The difference between something like map and mapL is like the difference between foldl and foldl' in Haskell, they're implemented almost the same way, but have slightly different behaviours. Check the base libs for details.

Ok, so Disciple can express the dominant paradigms of Haskell, now it's time to subvert them. In the "Beyond Haskell" session at HIW 2010 we talked about how sometimes when optimising a program in a high level language, you reach a point where you just have to drop down to a "lower level" language for performance reasons. This is like when you're writing a Ruby program and it's running so slow there's simply nothing else to do than to "shell out" to a C routine. This also happens in Haskell, but perhaps not as much.

Anyway, one of the goals of Disciple is to avoid this problem if at all possible. If the lazy, purely functional version of primes just isn't fast enough for you, then let's write the same program using destructive update of a mutable array:

-- | Check if an int is a multiple of any in a given array.
checkPrime :: Array Int -> Int -> Int -> Int -> Bool
checkPrime array high x n
| n >= high = True
| mod x array.(n) == 0 = False
| otherwise = checkPrime array high x (n + 1)

-- | Fill an array with primes.
fillPrimes :: Array Int -> Int -> Int -> Int -> ()
fillPrimes primes max high i
| high > max = ()

| checkPrime primes high i 0
= do primes.(high) := i
fillPrimes primes max (high + 1) (i + 1)

| otherwise
= fillPrimes primes max high (i + 1)

main ()
= do -- We want the 1500'th prime.
max = 1500

-- Start with an array containing the first prime as its first element.
primes = generate&{Array Int} (max + 1) (\_ -> 0)
primes.(0) := 2

-- Fill the array with more primes.
fillPrimes primes max 1 2

-- Take the last prime found as the answer.
println $ show primes.(max)

The syntax primes.(max) means "return the max'th element of the primes array". The syntax primes.(high) := i means "destructively update the high'th element of the primes array to the value i". Updating an array requires the array to be mutable. It's possible to express this mutability constraint in the type signature, but it's not required, so I usually just leave it to the inferencer. The compiler knows that you're using side effects, and will optimise around them, but you usually don't have to say anything about it in source level type sigs. I'll talk about the cases when you do in another post. Note that we don't have to pollute the type sigs with the IO constructor just to use destructive update. After all, checkPrime and fillPrimes aren't doing any IO...

The above code is good, but it still feels like a Haskell program. I'm particularly looking at the tail calls to checkPrime and fillPrimes. This is how you express primitive looping in many functional languages, but it can become tedious when there are lots of state variables to pass back to each iteration. Here is another version of primes written in a more imperative style.

-- | Check if an int is a multiple of any in a given array.
checkPrime :: Array Int -> Int -> Int -> Bool
checkPrime array high x
= do n = 0
isPrime = 1

while (n < high)
do when (mod x array.(n) == 0)
do isPrime := 0

n := n + 1

isPrime /= 0

main ()
= do -- We want the 1500'th prime.
max = 1500

-- Start with an array containing the first prime as its first element.
primes = generate&{Array Int} (max + 1) (\_ -> 0)
primes.(0) := 2

-- Fill the array with primes.
high = 1
i = 2
while (high <= max)
do when (checkPrime primes high i)
do primes.(high) := i
high := high + 1

i := i + 1

-- Take the last prime found as the answer.
println $ show primes.(max)

Now we've got while loops, and break, even. Of course, the while syntax desugars to a simple higher order function that takes the loop body, and break just throws an exception to end the iteration. Such syntax should help the Python programmers feel at home, but Disciple is still lambdas all the way down.

Finally, note that we started with a lazy, purely functional program but arrived at an unapologetically imperative program all in the same language. In my dreams people talk about "functional vs imperative" programs instead of "functional vs imperative" languages.

Write the lazy functional program because it's shorter, cleaner, and easier to understand.. but when the clock is tickin' and the space is leakin' then wouldn't it be nicer "shell out" to the same language?

PS: Code examples are part of the DDC test suite here

PPS: Of couse the absolute performance of these programs compiled with DDC today is atrocious. This is a development blog after all, and we're still developing. They do work though. See the wiki to find out how you can help.

Hacks: Type error beautification

In this series of blog posts I'll name the ones that describe what I'm currently working on after "Hacks: whatever".

I'm currently working on type error beautification. The problem is that without it you get error messages that look like this:

Inferred type:
:: forall %rDE111 %rDE112 %rDE113
. Int32 %rDE111 -> Int32 %rDE112 -(!eDE484 $cDE487)> Int32 %rDE113
:- !eDE484 = !{!Read %rDE111; !Read %rDE112}
, $cDE487 = Test.x : %rDE111

Is bigger than signature:
:: forall %rDE111 %rDE112 %rDE113
. Int32 %rDE111 -> Int32 %rDE112 -(!eDE484 $cDE487)> Int32 %rDE113
:- !eDE484 = !Read %rDE111
, $cDE487 = xDE485 : %rDE111

Because !{!Read %rDE111; !Read %rDE112} does not match !Read %rDE111

That's all fine, apart from the variable names like %rDE111 and $cDE485. Those are "dummy" variables that have been invented by the type elaborator, which fills in missing effect terms in type sigs provided by the user. For example, the source program that gave the above error message is actually:

add <: Int{read} -> Int -> Int
add x y = x + y

The type signature says that the function is permitted to read its first argument, where {read} is a short-hand way of giving the real region and effect term. The elaborator takes the short-hand sigs provided by the user, then fills in the missing bits using some simple defaulting rules, before passing them to the type checker proper. Of course, this function actually reads both of its arguments, hence the error message.

The problem is that when the elaborator needs a new variable it just takes the next available name in the namespace, so the exact names you get depend on what other signatures the elaborator has already processed. This means that if you add a new signature to an imported module, the dummy variables in all the successive signatures change. That's fine for type checking, but it means that error messages presented to the user aren't consistent. It's become a problem because our regression tests compare error messages to what they were before, so now if you modify any of the base libraries, you have to accept a bunch of wibbles in the test suite.

The solution is to rewrite variables in each set of error messages to use consistent names like t0, t1, %r1, %r2 etc before presenting them to the user. Complicating factors are that if a variable was a "real" variable that came from the source program we don't want to rewrite it, and that generated/elaborated names should not collide with real names from the source program.

This is one of those PITAs you discover when you sit down and try to write a real compiler. No sexy semantics, no fancy extensions to the type system. I need to format error messages properly so my regression tests don't break after every commit... but we press onward for the glory.

For Human Consumption

Over the last year I've spent a lot of time cleaning up the internals of DDC, and I've reached a point where I'm no longer embarrassed about the implementation. It's still full of bugs and holes, but enough of the structure and design is there that others should be able to find their way around the code base without too much trouble. I could also use some help, so this blog is an attempt to publicise the project and hopefully convince some of you wonderful people that Disciple/DDC is indeed the way of the future, and that spending some time hacking on it would be a fun thing to do...

I've been working on DDC since I started my PhD back in 2004, though there were several false starts and major rewrites along the way. Back then it was called "Trauma", because all the good names like "Clean", "Hope" and "Eden" were already taken, and I wanted a name that reflected what language design / compiler development / doing a PhD was really like.

DDC as a "plan" started to congeal in 2006 when I had done enough reading and research to form a vision of where I wanted to go. The first patch in the repo is from Sunday September 9th 2007. Before then my "version control" consisted of making a tarball every couple of weeks. Prior to starting my PhD I worked as a C++ programmer for the Australian government, and my experience of ClearCase there was enough to put me off version control systems entirely until I met darcs while working on GHC at MSR in 2007.

Anyway, from "Trauma" we now have "Disciplined Disciple". A few people have asked me about the awkwardly religious sounding name, so I'll explain it here for the record. I chose this name to remind myself to spend more time reading papers than dreaming up crazy language features. Writing a compiler is a massive amount of work, and the only way I see myself ending up with a usable system is by sticking with the game plan and not getting side tracked. Disciple is supposed to be Haskell - Laziness + Effect typing + Mutability Polymorphism. The way I see it there are a huge number of ideas and libraries from the Haskell community that would really shine if we just shifted the base language a bit, and if you've seen the "Beyond Haskell" talk from HIW 2010 in Baltimore you'll know what I mean by that. I want to see the next version of Quake written ground-up in a functional language, with no cop-outs about how it's good enough just to influence the design of C#2++. I had a gut-full of pointer based, monomorphic, imperative programming in my previous life and I'm not planning on going back.

I won't live long enough to invent a whole new eco system or approach to writing programs myself, but if I stick to the game plan and follow the leaders wherever possible I hope that will be enough to push it though. I am the Disciplined Disciple.