Ruby speed

meghatronic · 7 Apr 2010 at 12:02

I'm trying to Marshal a 60gb (2.07 billion lines) but the thing is going to take about a day to process before I Marshal it (rough calculations). WTF? That's 0.5Mb/s.

while(line = file.gets) or file.each do |line| similarly slow. Any alternatives?

Dj_Jestar · 7 Apr 2010 at 12:07

What are you actually doing to each line?

meghatronic · 7 Apr 2010 at 12:09

Dj_Jestar said:
What are you actually doing to each line?

At the moment, creating a hash to dump - which I realise is pretty stupid now as I'll run out of memory.

Dj_Jestar · 7 Apr 2010 at 12:12

You're hashing 60gb and complaining it's taking ages? :/

meghatronic · 7 Apr 2010 at 12:12

Dj_Jestar said:
You're hashing 60gb and complaining it's taking ages? :/

It still shouldn't take this long though!

Dj_Jestar · 7 Apr 2010 at 12:13

If you take out the hashing, and just run the loop, how long does it take?

Also, writing to console/stdout will slow things down a lot.

meghatronic · 7 Apr 2010 at 12:25

Dj_Jestar said:
If you take out the hashing, and just run the loop, how long does it take?

Also, writing to console/stdout will slow things down a lot.

Yeah - you're right, the hash is really slowing it down. I can't see an alternative though.

Basically I had a script generate this data, and write the results to a file.
Then had another script read the data and plod on with it. I've now merged the two, Marshalling at the end, so it's doing no disk access now. It just has to read a small file into memory and do some calculations (billions of them!) which takes 10hrs. It's looking like the hashing is tripling that, but ah well.

I could speed it up if there was a way to write iteratively to the dump.

At the moment it is doing:

h={}

blah.do
calcs
h[z]= zxyyhysdadsf
end

dump > h

If I could bypass the hash, it might speed up.

nightwish · 7 Apr 2010 at 12:39

I don't think any scripting language is really great for raw speed. Why not do it in C?

meghatronic · 7 Apr 2010 at 12:42

nightwish said:
I don't think any scripting language is really great for raw speed. Why not do it in C?

I'll have a look. To be honest, if loading the serialisation is quick... it really doesn't matter how long it takes to create it. I can just ignore it - assuming I don't run out of memory.

Dj_Jestar · 7 Apr 2010 at 13:02

Instead of putting each hash into an array, dump them immediately. This will prevent you from storing (in memory) a hash for each of the 2odd billion lines.

I'm not used to ruby syntax, but something like:

Code:

require 'digest/md5'
File.Open("thefile", "r") do | line |
  digest = Digest::MD5.hexdigest(line)
  dump > digest
end

NathanE · 7 Apr 2010 at 13:24

Consider using multi-threading also, if the machine has a multi-core CPU.

Dj_Jestar · 7 Apr 2010 at 13:38

Multi-threading can be done on any machine.. doesn't have to be multi-core/cpu?

meghatronic · 7 Apr 2010 at 13:39

I'm not entirely sure how one would multi-thread a file read.

NathanE · 7 Apr 2010 at 13:39

Dj_Jestar said:
Multi-threading can be done on any machine.. doesn't have to be multi-core/cpu?

But it wont gain a performance improvement on a single core machine...

NathanE · 7 Apr 2010 at 13:41

meghatronic said:
I'm not entirely sure how one would multi-thread a file read.

File reading isn't the problem. Hashing is. You could do a batch of say 1000 lines, queue up the hashing tasks to a thread pool, wait for them to be processed - then write out the results to the file. Then do the next batch...

It's down to how you want to implement it. Divide-and-conquer is the basic principle.

Dj_Jestar · 7 Apr 2010 at 13:41

meghatronic said:
I'm not entirely sure how one would multi-thread a file read.

One thread for file reading, many for hashing. Using some sort of event or observer to notify between the threads. Many threads to complete the task of hashing in parallel. I.e. a pool of "hashing" threads.

NathanE said:
But it wont gain a performance improvement on a single core machine...

True, but not a reason to not consider it anyway. Maybe I just read it that way, but it sounded like you were warning not to multi-thread on a machine that doesn't have multiple-cores/cpus rather than suggesting to take advantage of it where possible.

NathanE · 7 Apr 2010 at 13:46

Dj_Jestar said:
One thread for file reading, many for hashing. Using some sort of event or observer to notify between the threads. Many threads to complete the task of hashing in parallel. I.e. a pool of "hashing" threads.True, but not a reason to not consider it anyway. Maybe I just read it that way, but it sounded like you were warning not to multi-thread on a machine that doesn't have multiple-cores/cpus rather than suggesting to take advantage of it where possible.

It sounds like what he is writing is some sort of batch job script. In these sort of scenarios you can afford to take more consideration of "expected hardware" into account. Therefore, if it is multi-core, write the script to be multi-threaded. If the hardware is single-core - don't bother as it's just wasting your time.

meghatronic · 7 Apr 2010 at 13:50

NathanE said:
It sounds like what he is writing is some sort of batch job script. In these sort of scenarios you can afford to take more consideration of "expected hardware" into account. Therefore, if it is multi-core, write the script to be multi-threaded. If the hardware is single-core - don't bother as it's just wasting your time.

Indeed. 8x3.2Ghz cores, 16gb ram.

Dj_Jestar said:
One thread for file reading, many for hashing. Using some sort of event or observer to notify between the threads. Many threads to complete the task of hashing in parallel. I.e. a pool of "hashing" threads.

Wooosh. Over my head. I have reading to do ! !

NathanE · 7 Apr 2010 at 13:54

That will be awesome to watch once you've got it working. Please post before and after benchmarks for our amusement.

Dj_Jestar · 7 Apr 2010 at 13:57

meghatronic said:
Indeed. 8x3.2Ghz cores, 16gb ram.

Wooosh. Over my head. I have reading to do ! !

If you have a pool of 5 threads, you can (in theory

) hash 5 lines at the "same time".