Tuesday, June 1, 2010

OMG C++?

I had an interesting problem to solve involving some code that was essentially driven this way:

while(getline(cin, string) {
// process string
}


A coworker of mine suggested that this would only process a line at a time, which it will, but I was wondering if it was reading a line at a time as well as just serializing activity based on lines read. In essence I wondered if the input was buffered for cin.

On my platform I'm testing with, Mac OS X Snow Leopard, it appears that no buffering is really going on.

Here's some code to show what I mean:


void show_stats () {
if (in) {
cout << "Stream is broken or closed\n" << endl;
}
else {
cout << "Availble bytes buffered: " <<cin.rdbuf()->in_avail() << endl;
}
}



This looks at cin's underlying streambuf implementation and looks to see if there's any available bytes in the buffer. When there's no bytes in the buffer, the istream calls on the internal streambuf's "underflow" function to go get more data, and adjust the buffer for some number of "put back bytes".

What I found was that at no point was I seeing any buffered input coming in for cin, so I decided to write my own streambuf and subsequent istream classes to deal with both buffering and any file descriptor (unix pipe, socket, file etc).


#include <cstdio>
#include <cstring>
#include <streambuf>
#include <unistd.h>
#include <iostream>
#include <errno.h>

class fd_inbuf_buffered : public std::streambuf
{
protected:
int fd;
const int bSize;
char * buffer;

public:
fd_inbuf_buffered (int _fd, int _bSize=10) : fd(_fd), bSize(_bSize)
{
buffer = new char [bSize];
// The get pointer should not be at the beginning of the buffer, because
// it limits the ability to do put back into the input stream should
// there be a need to. Ideally that situation does not come up, but we
// leave room for 4 bytes, by pointing all 3 locations to 4 beyond the
// beginning of the buffer.
// 4 was the size used in an implementation in Josuttis' "The C++ Standard Library"
setg( buffer + 4, // beginning of putback area
buffer + 4, // read position
buffer + 4); // end position
}

~fd_inbuf_buffered ()
{
delete [] buffer;
}

protected:
// Underflow is what fills our buffer from the fd.
// if we don't override this, we get the parent, which just returns EOF.
virtual int_type underflow ()
{
//read position before end of buffer
if (gptr() < egptr())
{
return *gptr();
}

int numPutback = gptr() - eback();

//must limit the number of characters previously read into the putback
//buffer... 4 maximum

if (numPutback > 4)
{
numPutback = 4;
}
// Copy up to the putback buffer size characters back into the putback
// area of our buffer.
std::memcpy (buffer + (4 - numPutback), gptr() - numPutback, numPutback);

// read new characters
int num;
retry:
num = read(fd, buffer + 4, bSize - 4);
if (num == 0)
{
return EOF;
}
else if (num == -1) {
switch (errno) {
case EAGAIN:
case EINTR:
goto retry;
}
}

//reset buffer pointers
setg(buffer + (4 - numPutback), buffer + 4, buffer + 4 + num);

return *gptr();
}
};


struct fd_istream : public std::istream
{
protected:
fd_inbuf_buffered buf;
public:
explicit fd_istream (int fd, int bufsz) : buf(fd, bufsz), std::istream(&buf) {}
};



Now I can declare an istream like so:


fd_istream my_cin(0, 1000);



Where 0 is the numeric file descriptor for stdin and 1000 is the buffer size in bytes.

Because I went with the standard IOStream library, as opposed to just writing C style IO directly, I can use it in the same way I'd use any istream. I can use it with iterators or algorithms from the standard library, and I can even use it with getline as you can see below.


void show_stats () {
if (!my_cin) {
cout << "Stream is broken or closed\n" << endl;
}
else {
cout << "Availble bytes buffered: " << my_cin.rdbuf()->in_avail() << endl;
}
}

int main () {
string line;
while (getline(my_cin, line)) {
cout << line << endl;
show_stats();
}
show_stats();
}



In an example run, such as "cat /usr/share/dict/words | ./a.out" I see something like the following:


Availble bytes buffered: 12
Pinacoceras
Availble bytes buffered: 0
Pinacoceratidae
Availble bytes buffered: 980
pinacocytal
Availble bytes buffered: 968
pinacocyte
Availble bytes buffered: 957
pinacoid
Availble bytes buffered: 948
pinacoidal
Availble bytes buffered: 937
pinacol
Availble bytes buffered: 929
pinacolate
Availble bytes buffered: 918



showing how the buffer grows each time I make it read a certain number of bytes, and flows back down to 0. At 0 it calls underflow again, and I can get more data if available or when I hit EOF, I return that from underflow, causing the stream to terminate.

This stream will work for pipes, sockets and files as long as the file descriptor is provided to the constructor. Now because I have a putback buffer size of at least 4, I will have to have allocated at least 4 bytes in my streambuf to make room for the pointers to work properly. There are possibly better ways to deal with it, but for demonstration purposes, this works nicely.

C++ isn't always so bad after all. It just depends on how it's written.