Monday, September 14, 2009

Get by with a little help from my friends...

I've had two random people express an interest in the 9P in Haskell stuff I've been thinking about, for similar reasons. Tim Newsham went as far as to write a bunch of code for encoding/decoding of 9P messages to ByteStrings and mailed it to me.

I'm not 100% sure it's 100% correct, but it looks darned good, and far cleaner than what I'd written, so I've used it in place of mine at the 9ph project page.

I've attempted to port my version test code (sends a 9p version message to a styx server on Inferno) to see if that works, but I'm not sure that the string type is being handled properly for the TVersion message, where it must be prefixed by a little endian encoded 16 bit integer for the protocol version being negotiated ("9P2000" in our case).

Still 9P in Haskell looks like it will be useful for several folks, and I'm glad for that, even if I don't get mine done.

Thursday, September 10, 2009

Prime Sieves again

Well I've been playing with libdispatch quite a bit lately now. Apple just open sourced it here, and I hope linux/BSD folks figure out how to adopt it, because it's really cool!

I took my pipe based CSP-like prime sieve, and at the prodding of my friend Kevin at Apple, rewrote it so that the blocks themselves contain the "message" and the queues contain the context I'm interested in. This is in contrast to the pipe fd's being an event source to fire a block, and the fd itself having the context once it was an event source object.

This meant re-structuring the code a bit, and capturing the value I would have written to the pipe in the block's closure. To do so I opted to make a function as such


// Runs a block asynchronously on a queue
void run_block (dispatch_queue_t q, int cur_num) {
dispatch_async(q,
^{
struct context * ctx = (struct context *) dispatch_get_context(q);

if (cur_num == LAST) {
if (ctx->has_spawned == 1) {
run_block(ctx->neighbor, cur_num);
}
dispatch_release(q);
}
else if (cur_num == ctx->mynum) {
printf("%d\n", cur_num);
}
else if (cur_num % ctx->mynum != 0) {
if (ctx->has_spawned == 0) {
ctx->neighbor = create_queue(cur_num);
ctx->has_spawned = 1;
}
run_block(ctx->neighbor, cur_num);
}

});
}


(Full source here)

I had someone tell me #define'ng MAX was bad, and I wanted it to be a command line argument anyway, so I made it so. (here)

Now this version runs pretty nicely, but all that block copying overhead is causing a lot of malloc'ng memory for const copies of contexts. Seems a waste. My friend Kevin pointed this out to me and said "try the async_f APIs".

These take a function pointer rather than a block. It's easy enough to convert this block to the function pointer form:


void worker (void * fctx) {
int cur_num = (int) fctx; // bad bad bad, but I don't care.
dispatch_queue_t cur_q = dispatch_get_current_queue();
struct context * qctx = (struct context *) dispatch_get_context(cur_q);

if (cur_num == LAST) {
if (qctx->has_spawned == 1) {
dispatch_async_f(qctx->neighbor, (void *)cur_num, worker);
}
dispatch_release(cur_q);
}
else if (cur_num == qctx->mynum) {
printf("%d\n", cur_num);
}
else if (cur_num % qctx->mynum != 0) {
if (qctx->has_spawned == 0) {
qctx->neighbor = create_queue(cur_num);
qctx->has_spawned = 1;
}
dispatch_async_f(qctx->neighbor, (void *)cur_num, worker);
}
}


Now when I run my for loop it looks like this:


for (x = 2; x <= max; ++x) {
dispatch_async_f(two_q, (void *)x, worker);
}


What is very interesting to look at is the pipe version's performance, vs the block copy, vs the function pointer version.

Sieving all the primes from 2 to 9000 with the file descriptor version:
1.68s user 5.05s system 171% cpu 3.929 total

With block copy:
0.24s user 0.02s system 118% cpu 0.219 total

With function pointers:
0.06s user 0.02s system 78% cpu 0.098 total

It's hard to do a lot of sieving with the file descriptor version, because you have up your ulimit, and it hits the kernel a lot (see the system time of 5.05s, vs 1.68 user time!). To demonstrate how much faster the function pointer version is compared to the block version, we need a larger range.

Here's results for 2-90000

With block copy:
14.52s user 0.21s system 161% cpu 9.119 total

With function pointers:
2.66s user 0.15s system 130% cpu 2.156 total

Look at that speedup! Note how very little time is spent in the kernel making system calls.

Not too shabby.

One thing I did find out is that dispatch_async on a queue that has not set it's own target_queue to the main queue, will create a TON of threads. I had up to 179 at one point. Once I added the following line to the queue creation function I wrote, it stably stays at 4 threads:


dispatch_set_target_queue(rv, dispatch_get_global_queue(0,0));


This was again advice given to me by my friend Kevin at Apple.

Thanks Kevin, this has been enlightening!

Tuesday, September 8, 2009

Last bug found

Kevin spotted this before I did, but I'm afraid it's not quite what he thought. I was using calloc to allocate a context for an event source, but part of that context was initialized after the block's context was created. This lazy initialization of the context caused me trouble in that I was writing a -1 to stdin (0, because I used calloc should have been stdin, but it was making ???? appear at the console).

At any rate the completed, and corrected code is here.

Kevin went on to inform me that I don't need the semaphore and task counter because I can just use groups and group synchronization on the dispatch_objects that are sources. He sent me some code, and I'll probably take a look at it, but for the purposes of demonstration I think this code is basically final. (Groups do make sense to use here however as there's less manual book keeping, and that's part of the advantage of libdispatch and GCD).

Monday, September 7, 2009

A simple solution?

Well that crash I caused is partially due to my own terrible error handling, and the fact that I didn't boost my ulimit for file descriptors (I am using a good bit of pipes in my process, but my per-user limit is ridiculously low!!!).

Upped the ulimit to 2048 from a pathetically low 256 and now I can get through a lot more prime numbers before barfing. (It's a laptop man! I'm the ONLY user...)

A rather big part of this is that I'm using pipes for communication between event sources, which fire off my blocks. There's exactly one pair of fd's for every prime I find, plus all the fds in use under my uname.

More libdispatch fun

Been playing with libdispatch of GCD from Snow Leopard for a few days now, and I'm having a heck of a time with a demo program I've been writing that's just not working for me yet.

In fact, today while trying to make the synchronization happier (and succeeded), I realized I'm getting some junk out on stdout. After trying to figure out exactly why this is happening by tweaking the code and rebuilding it, I managed to panic the kernel.

The code is still at lisppaste.

Friday, September 4, 2009

Adventures with libdispatch

Apple released Snow Leopard to the world last Thursday, and with it came this interesting piece of technology called Grand Central Dispatch. It totally changes the way C, C++, and Objective-C programmers will approach concurrency and threading if they were used to anything at all like pthreads.

Having built up a bit of a functional programming background for myself, I'm no stranger to new ways to do concurrent programming. Scheme has its ways. Haskell has a couple of ways (explicitly forkIO'ng sparks that might be run on threads, or data parallelism). Also, I've tinkered with such things as typed data channels in Limbo from the folks at Bell Labs (seemingly inspired by Newsqueak by Rob Pike and CSP itself).

When I first looked at Grand Central Dispatch from Apple, I was a bit puzzled by the API. There's queues, event sources, asynchronous dispatching, and this interesting feature they've added to their implementation C language (though some of my fellow 9fans have been somewhat outspoken against it) called blocks. Blocks are closures for C. If you want to know more about closures consult a good book on Scheme or a Scheme tutorial. (or Lisp!)

I'm more used to a CSP-like programming interface, and I wondered if I could port a rather famous demo in Newsqueak, and Limbo to this interface. It's a Prime Number Sieve (Pike?).

The algorithm may be explained as such (poorly?):

  1. Create a channel of integers
  2. Start a process that puts onto the channel from step 1 all the numbers from [2,N]
  3. Start a process that reads from the channel all incoming numbers. The first number read is "special" and becomes that process's special number. This process prints it's number and creates another channel of ints, that it will feed all numbers that fail the test (incoming_number % my_special_num == 0). In this way it filters out all numbers that are multiples of it's special number.
  4. Start another process and feed it the first number that passes through this filter.... the new process follows the same step 3 as above.
This creates us a sieve that only lets prime numbers escape.

A good animation of how one works is here.

My prose is rather lacking, so without further ado, here is the code I came up with to try to approach a prime number sieve, using pipes, and blocks, and libdispatch routines to concurrently schedule the sieve on my mac.
// David Leimbach
// September 4, 2009
//
// First, unclean attempt at a prime number sieve in C with libdispatch
//
// Special thanks to Kevin Van Vechten and Jordan Hubbard for guidance through this new API!
//
// Thanks to Apple for making Snow Leopard so affordable!

#include <dispatch/dispatch.h>

#include <stdio.h>

#include <stdlib.h>

#include <sys/types.h>

#include <sys/uio.h>

#include %lt;unistd.h>


#define MAX 300

struct context
{
int mynum;
int has_spawned;
int writefd;
};

void setup_block(int read_fd) {
dispatch_source_t piperead;
struct context * ctx;
piperead = dispatch_source_create(DISPATCH_SOURCE_TYPE_READ, read_fd,
0, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_LOW, 0));

ctx = calloc(sizeof(struct context), 1);

ctx->has_spawned = 0;

dispatch_source_set_event_handler(piperead,
^{
int val;
int fd[2];
struct context * ctx = dispatch_get_context(piperead);

read(read_fd, &val, 4);

if (ctx->has_spawned == 0) {
pipe(fd);
ctx->writefd = fd[1];
ctx->mynum = val;
ctx->has_spawned = 1;
setup_block(fd[0]);
printf("%d\n", val);
}
else {
if (val % ctx->mynum != 0) {
write(ctx->writefd, &val, 4);
}
}
});

dispatch_set_context(piperead, ctx); // associate this state with this context

dispatch_resume(piperead);
}


void run_sieve(int writefd) {
int x;

for(x=2; x <= MAX; ++x)
   write(writefd, &x, 4);
   sleep(2); 
}  

int main () {
    int fd[2];
    pipe(fd);
    setup_block(fd[0]);    
    run_sieve(fd[1]);    
    return 0; 
}
 (Please forgive the code formatting, I'm having great difficulty with blogspot on this stuff with their editor)

Tuesday, September 1, 2009

Plan 9 machine running

Erik Quanstom's 9atom.iso build of 9load correctly finds my motherboard specifics and can install a Plan 9 system on my old machine making it such that I don't need to buy a new one. (hooray!)

In the process of figuring this all out, I learned that a Google SoC project "9null" can use Plan 9 to boot Plan 9 directly, rather than 9load which is a significant subset of plan 9 to boot plan 9.

Seems like this should just be easier for the smaller Plan 9 community to maintain. I've not tried 9null out yet, but I'm pretty anxious to give it a go.

This weekend looks like it might be a good time for some of those side projects.