I took my pipe based CSP-like prime sieve, and at the prodding of my friend Kevin at Apple, rewrote it so that the blocks themselves contain the "message" and the queues contain the context I'm interested in. This is in contrast to the pipe fd's being an event source to fire a block, and the fd itself having the context once it was an event source object.
This meant re-structuring the code a bit, and capturing the value I would have written to the pipe in the block's closure. To do so I opted to make a function as such
// Runs a block asynchronously on a queue
void run_block (dispatch_queue_t q, int cur_num) {
dispatch_async(q,
^{
struct context * ctx = (struct context *) dispatch_get_context(q);
if (cur_num == LAST) {
if (ctx->has_spawned == 1) {
run_block(ctx->neighbor, cur_num);
}
dispatch_release(q);
}
else if (cur_num == ctx->mynum) {
printf("%d\n", cur_num);
}
else if (cur_num % ctx->mynum != 0) {
if (ctx->has_spawned == 0) {
ctx->neighbor = create_queue(cur_num);
ctx->has_spawned = 1;
}
run_block(ctx->neighbor, cur_num);
}
});
}
(Full source here)
I had someone tell me #define'ng MAX was bad, and I wanted it to be a command line argument anyway, so I made it so. (here)
Now this version runs pretty nicely, but all that block copying overhead is causing a lot of malloc'ng memory for const copies of contexts. Seems a waste. My friend Kevin pointed this out to me and said "try the async_f APIs".
These take a function pointer rather than a block. It's easy enough to convert this block to the function pointer form:
void worker (void * fctx) {
int cur_num = (int) fctx; // bad bad bad, but I don't care.
dispatch_queue_t cur_q = dispatch_get_current_queue();
struct context * qctx = (struct context *) dispatch_get_context(cur_q);
if (cur_num == LAST) {
if (qctx->has_spawned == 1) {
dispatch_async_f(qctx->neighbor, (void *)cur_num, worker);
}
dispatch_release(cur_q);
}
else if (cur_num == qctx->mynum) {
printf("%d\n", cur_num);
}
else if (cur_num % qctx->mynum != 0) {
if (qctx->has_spawned == 0) {
qctx->neighbor = create_queue(cur_num);
qctx->has_spawned = 1;
}
dispatch_async_f(qctx->neighbor, (void *)cur_num, worker);
}
}
Now when I run my for loop it looks like this:
for (x = 2; x <= max; ++x) {
dispatch_async_f(two_q, (void *)x, worker);
}
What is very interesting to look at is the pipe version's performance, vs the block copy, vs the function pointer version.
Sieving all the primes from 2 to 9000 with the file descriptor version:
1.68s user 5.05s system 171% cpu 3.929 total
With block copy:
0.24s user 0.02s system 118% cpu 0.219 total
With function pointers:
0.06s user 0.02s system 78% cpu 0.098 total
It's hard to do a lot of sieving with the file descriptor version, because you have up your ulimit, and it hits the kernel a lot (see the system time of 5.05s, vs 1.68 user time!). To demonstrate how much faster the function pointer version is compared to the block version, we need a larger range.
Here's results for 2-90000
With block copy:
14.52s user 0.21s system 161% cpu 9.119 total
With function pointers:
2.66s user 0.15s system 130% cpu 2.156 total
Look at that speedup! Note how very little time is spent in the kernel making system calls.
Not too shabby.
One thing I did find out is that dispatch_async on a queue that has not set it's own target_queue to the main queue, will create a TON of threads. I had up to 179 at one point. Once I added the following line to the queue creation function I wrote, it stably stays at 4 threads:
dispatch_set_target_queue(rv, dispatch_get_global_queue(0,0));
This was again advice given to me by my friend Kevin at Apple.
Thanks Kevin, this has been enlightening!
No comments:
Post a Comment