Random file access

Soldato
Joined
24 Nov 2002
Posts
16,378
Location
38.744281°N 104.846806°W
Further to a previous thread, an interesting challenge :)

What's the quickest way to read a random line from a 1Gb text file (static files, so can cache number of lines). Obviously I don't want to load the entire file into memory. I've got 15s in Perl:

Code:
perl -e 'srand; rand($.) < 1 && ($n = $_) while <>; print $n' FILE

My ruby version was 30s+ :(.

Something I could call within Ruby would be nice, e.g. Ruby itself, bash, perl, python, java etc. I'll be running this across 45 files, millions of times, so speed is essential.
 
Last edited:
Nothing stopping you from running mulitple threads..

It may be easier building a hash index at the start..
Hash would be nasty as it's 45 Gb of data.

I will be multithreading the actual exrcution, but this one process is a serious bottleneck.
 
Do the lines have a fixed length (bytes)? i.e. do you where the new line characters are without having to scan for them?
Sadly not.

Sed is winning, <5sec average:

Code:
sed -n #{r+1}q;#{r}p #{file}
Where r is random number 1 to file length. I ****** love unix at times.
 
Last edited:
So if you have 45GB and then have a dense index of the line start positions your first parse would be slower but then a random selection from the index and average fetch would be quicker than 5 seconds.
Multiply that saving by a large number and what appears quicker initially isn't.
Of course, but where do I store 45Gb of data in active memory? This is the problem I had in the other thread.
 
Also with you current sed mechanism, you're saturating the transfer buses/cpu with data that you're not using. In essence you're copying data you don't need to. In building a data index, the initial data processing is required. Then from that point the accesses are just redirection links and the machine only spends it's time reading data directly related to retrieving the line you originally want.

Sorry but from a supercomputing background, the full file scanning using sed etc pushes my ignition button!

You may have noticed my use of 64bit file pointers. The reason for making that distinction is you should check the C compiler or language's file pointer isn't 32bit (old school C was signed 32bit, newer file pointers are unsigned 64bit).
Thanks for your posts! Will look into it tomorrow :)
 
Back
Top Bottom