Monday, December 19, 2011

Reading big file quickly by skipping

This is to read a grid file with columns such as
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a6 b6 c6
a7 b7 c7
a8 b8 c8
a9 b9 c9
a10 b10 c10


say we want to read 'approximately' every 4 lines, i.e. line2 then line6 and then line10, then what we can do is
to jump using the file offset. As we only want to read a 'whole' line and we can't guarantee  whether a line is 'whole' or not if we use the file offset, we read and then discard a line, and read the next one. This assumes a line is separated by newline.


read line and discard
read line and save //a2 b2 c2
jump 27bytes ahead //a line is approx 9bytes with a newline
read line and discard
read line and save //a6 b6 c6
jump 27bytes ahead
read line and discard
read line and save //a10 b10 c10
...


     using boost::iostreams::position_to_offset;
     using boost::iostreams::seek;
     using boost::iostreams::stream_offset;

     stream_offset off;

         // save EOF offset so we know if we have finished reading the file
     stream_offset endoff = position_to_offset(seek(in, 0, BOOST_IOS::end));
     // go back to beginning of file, we have gone to EOF in line above
     seek(in, 0, BOOST_IOS::beg);
     string line2;
     while( true ) {
                 // first read, we get "incomplete line", so ignore it
         getline(in, line2);

                 // second read, we get "complete" line
         getline(in, line2);
         //cout << line2 << endl;

                 // jump approximately 1000 lines ahead
         off = position_to_offset(seek(in, 200000, BOOST_IOS::cur));

                 // finish reading the file
         if( off > endoff ) {
             break;
         }
     }

     cout << "finished" << endl;

No comments:

Post a Comment