Ruby speed

Soldato
Joined
24 Nov 2002
Posts
16,378
Location
38.744281°N 104.846806°W
I'm trying to Marshal a 60gb (2.07 billion lines) but the thing is going to take about a day to process before I Marshal it (rough calculations). WTF? That's 0.5Mb/s.

while(line = file.gets) or file.each do |line| similarly slow. Any alternatives?
 
If you take out the hashing, and just run the loop, how long does it take?

Also, writing to console/stdout will slow things down a lot.
 
If you take out the hashing, and just run the loop, how long does it take?

Also, writing to console/stdout will slow things down a lot.
Yeah - you're right, the hash is really slowing it down. I can't see an alternative though.

Basically I had a script generate this data, and write the results to a file.
Then had another script read the data and plod on with it. I've now merged the two, Marshalling at the end, so it's doing no disk access now. It just has to read a small file into memory and do some calculations (billions of them!) which takes 10hrs. It's looking like the hashing is tripling that, but ah well.

I could speed it up if there was a way to write iteratively to the dump.

At the moment it is doing:

h={}

blah.do
calcs
h[z]= zxyyhysdadsf
end

dump > h

If I could bypass the hash, it might speed up.
 
Last edited:
I don't think any scripting language is really great for raw speed. Why not do it in C?
I'll have a look. To be honest, if loading the serialisation is quick... it really doesn't matter how long it takes to create it. I can just ignore it - assuming I don't run out of memory.
 
Instead of putting each hash into an array, dump them immediately. This will prevent you from storing (in memory) a hash for each of the 2odd billion lines.

I'm not used to ruby syntax, but something like:
Code:
require 'digest/md5'
File.Open("thefile", "r") do | line |
  digest = Digest::MD5.hexdigest(line)
  dump > digest
end
 
I'm not entirely sure how one would multi-thread a file read.

File reading isn't the problem. Hashing is. You could do a batch of say 1000 lines, queue up the hashing tasks to a thread pool, wait for them to be processed - then write out the results to the file. Then do the next batch...

It's down to how you want to implement it. Divide-and-conquer is the basic principle.
 
I'm not entirely sure how one would multi-thread a file read.
One thread for file reading, many for hashing. Using some sort of event or observer to notify between the threads. Many threads to complete the task of hashing in parallel. I.e. a pool of "hashing" threads.
But it wont gain a performance improvement on a single core machine...
True, but not a reason to not consider it anyway. Maybe I just read it that way, but it sounded like you were warning not to multi-thread on a machine that doesn't have multiple-cores/cpus rather than suggesting to take advantage of it where possible.
 
One thread for file reading, many for hashing. Using some sort of event or observer to notify between the threads. Many threads to complete the task of hashing in parallel. I.e. a pool of "hashing" threads.True, but not a reason to not consider it anyway. Maybe I just read it that way, but it sounded like you were warning not to multi-thread on a machine that doesn't have multiple-cores/cpus rather than suggesting to take advantage of it where possible.

It sounds like what he is writing is some sort of batch job script. In these sort of scenarios you can afford to take more consideration of "expected hardware" into account. Therefore, if it is multi-core, write the script to be multi-threaded. If the hardware is single-core - don't bother as it's just wasting your time.
 
It sounds like what he is writing is some sort of batch job script. In these sort of scenarios you can afford to take more consideration of "expected hardware" into account. Therefore, if it is multi-core, write the script to be multi-threaded. If the hardware is single-core - don't bother as it's just wasting your time.
Indeed. 8x3.2Ghz cores, 16gb ram.

One thread for file reading, many for hashing. Using some sort of event or observer to notify between the threads. Many threads to complete the task of hashing in parallel. I.e. a pool of "hashing" threads.
Wooosh. Over my head. I have reading to do ! !
 
Last edited:
That will be awesome to watch once you've got it working. Please post before and after benchmarks for our amusement.
 
Back
Top Bottom