Random file access

Soldato
Joined
24 Nov 2002
Posts
16,378
Location
38.744281°N 104.846806°W
Further to a previous thread, an interesting challenge :)

What's the quickest way to read a random line from a 1Gb text file (static files, so can cache number of lines). Obviously I don't want to load the entire file into memory. I've got 15s in Perl:

Code:
perl -e 'srand; rand($.) < 1 && ($n = $_) while <>; print $n' FILE

My ruby version was 30s+ :(.

Something I could call within Ruby would be nice, e.g. Ruby itself, bash, perl, python, java etc. I'll be running this across 45 files, millions of times, so speed is essential.
 
Last edited:
Nothing stopping you from running mulitple threads..

It may be easier building a hash index at the start..
Hash would be nasty as it's 45 Gb of data.

I will be multithreading the actual exrcution, but this one process is a serious bottleneck.
 
Do the lines have a fixed length (bytes)? i.e. do you where the new line characters are without having to scan for them?
 
Do the lines have a fixed length (bytes)? i.e. do you where the new line characters are without having to scan for them?
Sadly not.

Sed is winning, <5sec average:

Code:
sed -n #{r+1}q;#{r}p #{file}
Where r is random number 1 to file length. I ****** love unix at times.
 
Last edited:
Hash would be nasty as it's 45 Gb of data.

I will be multithreading the actual exrcution, but this one process is a serious bottleneck.

So if you have 45GB and then have a dense index of the line start positions your first parse would be slower but then a random selection from the index and average fetch would be quicker than 5 seconds.
Multiply that saving by a large number and what appears quicker initially isn't.

Also note that as the file is on a single drive, multithreading will cause you an issue as the drive will be spamming head moves. Assuming the file isn't located on an enterprise SAN with it's own cache..
If you don't need to have a specific 'random' you can select by random, then sequence the fetches into single passes that sequentially seek through the file.

Also 'average' and random timings aren't really comparable unless you're starting on the same seed.
 
So if you have 45GB and then have a dense index of the line start positions your first parse would be slower but then a random selection from the index and average fetch would be quicker than 5 seconds.
Multiply that saving by a large number and what appears quicker initially isn't.
Of course, but where do I store 45Gb of data in active memory? This is the problem I had in the other thread.
 
Of course, but where do I store 45Gb of data in active memory? This is the problem I had in the other thread.

It's not going to be 45GB.

1. Parse the 45GB file for lines - at each start of the line write out the location to a separate index file/array. The idea is to build a file/array with constant sized elements so they can be accessed mathematically as a hash of the input line number. As each line is only represented by a number rather than the entire line, the space required to store the hashed array is considerably less.

FilePointer64 fileIndex[MAXLINES];
long currentLine=0;
while(not at the end of the file or more than MAXLINES) {
if( start of line )
fileIndex[currentLine++]=filepointer (offset from the start of the file) for start of line
}
totalNumberOfLines = currentLine;

Assuming that the array will fit in memory.. if it doesn't then it can be stored as a file but it's best if that file is stored on a different drive to the 45GB data file.

2. When you want to access the line number of your choice:
line number = random 0..totalNumberOfLines
read line = READ from 45GB file, the line starting at the file offset pointer fileIndex[line number]

In pseudo code but you can see what I mean.

The sequence access planning is a but harder but it just means having a list that you create (say 100 deep) then sorting that list before accessing to read 100 lines.. note - you read sequentially but the data requested is still random (ie only the access is sequential but not the returned results).
 
Last edited:
Also with you current sed mechanism, you're saturating the transfer buses/cpu with data that you're not using. In essence you're copying data you don't need to. In building a data index, the initial data processing is required. Then from that point the accesses are just redirection links and the machine only spends it's time reading data directly related to retrieving the line you originally want.

Sorry but from a supercomputing background, the full file scanning using sed etc pushes my ignition button!

You may have noticed my use of 64bit file pointers. The reason for making that distinction is you should check the C compiler or language's file pointer isn't 32bit (old school C was signed 32bit, newer file pointers are unsigned 64bit).
 
Last edited:
Also with you current sed mechanism, you're saturating the transfer buses/cpu with data that you're not using. In essence you're copying data you don't need to. In building a data index, the initial data processing is required. Then from that point the accesses are just redirection links and the machine only spends it's time reading data directly related to retrieving the line you originally want.

Sorry but from a supercomputing background, the full file scanning using sed etc pushes my ignition button!

You may have noticed my use of 64bit file pointers. The reason for making that distinction is you should check the C compiler or language's file pointer isn't 32bit (old school C was signed 32bit, newer file pointers are unsigned 64bit).
Thanks for your posts! Will look into it tomorrow :)
 
Hello Everybody.
Random access files consist of records that can be accessed in any sequence. This means the data is stored exactly as it appears in memory, thus saving processing time both in when the file is written and in when it is read.
 
Hello Everybody.
Random access files consist of records that can be accessed in any sequence. This means the data is stored exactly as it appears in memory, thus saving processing time both in when the file is written and in when it is read.

Random accessability for files has nothing todo with RAM. I think you're thinking of memory mapped files - to which you'd require a 64bit address space due to the size of the file. Also this doesn't address the time to access as you'd still be accessing the files without indexes etc. As it's 45GB caching in 4gb of ram wouldn't return a good speed increase. By indexing and caching the
Smaller index the system would benefit from 4gb of ram...
 
Back
Top Bottom