Random file access

meghatronic · 10 Apr 2010 at 13:51

Further to a previous thread, an interesting challenge

What's the quickest way to read a random line from a 1Gb text file (static files, so can cache number of lines). Obviously I don't want to load the entire file into memory. I've got 15s in Perl:

Code:

perl -e 'srand; rand($.) < 1 && ($n = $_) while <>; print $n' FILE

My ruby version was 30s+

.

Something I could call within Ruby would be nice, e.g. Ruby itself, bash, perl, python, java etc. I'll be running this across 45 files, millions of times, so speed is essential.

NickK · 10 Apr 2010 at 14:04

Nothing stopping you from running mulitple threads..

It may be easier building a hash index at the start..

meghatronic · 10 Apr 2010 at 14:06

NickK said:
Nothing stopping you from running mulitple threads..

It may be easier building a hash index at the start..

Hash would be nasty as it's 45 Gb of data.

I will be multithreading the actual exrcution, but this one process is a serious bottleneck.

Spyhop · 10 Apr 2010 at 14:19

Do the lines have a fixed length (bytes)? i.e. do you where the new line characters are without having to scan for them?

meghatronic · 10 Apr 2010 at 14:25

wush said:
Do the lines have a fixed length (bytes)? i.e. do you where the new line characters are without having to scan for them?

Sadly not.

Sed is winning, <5sec average:

Code:

sed -n #{r+1}q;#{r}p #{file}

Where r is random number 1 to file length. I ****** love unix at times.

NickK · 10 Apr 2010 at 15:47

meghatronic said:
Hash would be nasty as it's 45 Gb of data.

I will be multithreading the actual exrcution, but this one process is a serious bottleneck.

So if you have 45GB and then have a dense index of the line start positions your first parse would be slower but then a random selection from the index and average fetch would be quicker than 5 seconds.
Multiply that saving by a large number and what appears quicker initially isn't.

Also note that as the file is on a single drive, multithreading will cause you an issue as the drive will be spamming head moves. Assuming the file isn't located on an enterprise SAN with it's own cache..
If you don't need to have a specific 'random' you can select by random, then sequence the fetches into single passes that sequentially seek through the file.

Also 'average' and random timings aren't really comparable unless you're starting on the same seed.

meghatronic · 10 Apr 2010 at 16:06

NickK said:
So if you have 45GB and then have a dense index of the line start positions your first parse would be slower but then a random selection from the index and average fetch would be quicker than 5 seconds.
Multiply that saving by a large number and what appears quicker initially isn't.

Of course, but where do I store 45Gb of data in active memory? This is the problem I had in the other thread.

NickK · 10 Apr 2010 at 16:42

meghatronic said:
Of course, but where do I store 45Gb of data in active memory? This is the problem I had in the other thread.

It's not going to be 45GB.

1. Parse the 45GB file for lines - at each start of the line write out the location to a separate index file/array. The idea is to build a file/array with constant sized elements so they can be accessed mathematically as a hash of the input line number. As each line is only represented by a number rather than the entire line, the space required to store the hashed array is considerably less.

FilePointer64 fileIndex[MAXLINES];
long currentLine=0;
while(not at the end of the file or more than MAXLINES) {
if( start of line )
fileIndex[currentLine++]=filepointer (offset from the start of the file) for start of line
}
totalNumberOfLines = currentLine;

Assuming that the array will fit in memory.. if it doesn't then it can be stored as a file but it's best if that file is stored on a different drive to the 45GB data file.

2. When you want to access the line number of your choice:
line number = random 0..totalNumberOfLines
read line = READ from 45GB file, the line starting at the file offset pointer fileIndex[line number]

In pseudo code but you can see what I mean.

The sequence access planning is a but harder but it just means having a list that you create (say 100 deep) then sorting that list before accessing to read 100 lines.. note - you read sequentially but the data requested is still random (ie only the access is sequential but not the returned results).

NickK · 10 Apr 2010 at 16:54

Also with you current sed mechanism, you're saturating the transfer buses/cpu with data that you're not using. In essence you're copying data you don't need to. In building a data index, the initial data processing is required. Then from that point the accesses are just redirection links and the machine only spends it's time reading data directly related to retrieving the line you originally want.

Sorry but from a supercomputing background, the full file scanning using sed etc pushes my ignition button!

You may have noticed my use of 64bit file pointers. The reason for making that distinction is you should check the C compiler or language's file pointer isn't 32bit (old school C was signed 32bit, newer file pointers are unsigned 64bit).

meghatronic · 10 Apr 2010 at 17:11

NickK said:
Also with you current sed mechanism, you're saturating the transfer buses/cpu with data that you're not using. In essence you're copying data you don't need to. In building a data index, the initial data processing is required. Then from that point the accesses are just redirection links and the machine only spends it's time reading data directly related to retrieving the line you originally want.

Sorry but from a supercomputing background, the full file scanning using sed etc pushes my ignition button!

You may have noticed my use of 64bit file pointers. The reason for making that distinction is you should check the C compiler or language's file pointer isn't 32bit (old school C was signed 32bit, newer file pointers are unsigned 64bit).

Thanks for your posts! Will look into it tomorrow

NickK · 12 Apr 2010 at 18:30

So... what happened?

meghatronic · 12 Apr 2010 at 19:11

NickK said:
So... what happened?

Haven't had chance to look into it yet as I'm still waiting on data

.

cbrown · 13 Apr 2010 at 13:16

Hello Everybody.
Random access files consist of records that can be accessed in any sequence. This means the data is stored exactly as it appears in memory, thus saving processing time both in when the file is written and in when it is read.

NickK · 13 Apr 2010 at 16:54

cbrown said:
Hello Everybody.
Random access files consist of records that can be accessed in any sequence. This means the data is stored exactly as it appears in memory, thus saving processing time both in when the file is written and in when it is read.

Random accessability for files has nothing todo with RAM. I think you're thinking of memory mapped files - to which you'd require a 64bit address space due to the size of the file. Also this doesn't address the time to access as you'd still be accessing the files without indexes etc. As it's 45GB caching in 4gb of ram wouldn't return a good speed increase. By indexing and caching the
Smaller index the system would benefit from 4gb of ram...