Ruby speed

Soldato
Joined
24 Nov 2002
Posts
16,378
Location
38.744281°N 104.846806°W
I'm trying to Marshal a 60gb (2.07 billion lines) but the thing is going to take about a day to process before I Marshal it (rough calculations). WTF? That's 0.5Mb/s.

while(line = file.gets) or file.each do |line| similarly slow. Any alternatives?
 
If you take out the hashing, and just run the loop, how long does it take?

Also, writing to console/stdout will slow things down a lot.
Yeah - you're right, the hash is really slowing it down. I can't see an alternative though.

Basically I had a script generate this data, and write the results to a file.
Then had another script read the data and plod on with it. I've now merged the two, Marshalling at the end, so it's doing no disk access now. It just has to read a small file into memory and do some calculations (billions of them!) which takes 10hrs. It's looking like the hashing is tripling that, but ah well.

I could speed it up if there was a way to write iteratively to the dump.

At the moment it is doing:

h={}

blah.do
calcs
h[z]= zxyyhysdadsf
end

dump > h

If I could bypass the hash, it might speed up.
 
Last edited:
I don't think any scripting language is really great for raw speed. Why not do it in C?
I'll have a look. To be honest, if loading the serialisation is quick... it really doesn't matter how long it takes to create it. I can just ignore it - assuming I don't run out of memory.
 
It sounds like what he is writing is some sort of batch job script. In these sort of scenarios you can afford to take more consideration of "expected hardware" into account. Therefore, if it is multi-core, write the script to be multi-threaded. If the hardware is single-core - don't bother as it's just wasting your time.
Indeed. 8x3.2Ghz cores, 16gb ram.

One thread for file reading, many for hashing. Using some sort of event or observer to notify between the threads. Many threads to complete the task of hashing in parallel. I.e. a pool of "hashing" threads.
Wooosh. Over my head. I have reading to do ! !
 
Last edited:
Personally I would use a thread pool. I'm sure Ruby has one. It saves a lot of boilerplate-code/hassle WRT to spawning and lifecycle of the threads.
Just spent a few hours trying to implement a threadpool, and for whatever reason it's ended up slower. Stupid Ruby.

May do it in Java now...

EDIT - seems to be slower on OSX, so I'm blaming 1.8.6 and upgrading now.
 
Last edited:
For the Ruby 1.8 series at least with the normal reference implementation, IIRC threads don't run in parallel, it basically just has a scheduler in the vm which timeslices; "green threads" in ruby parlance.
Maybe it's similar to the whole GIL Python thing.

This is obviously crap from the point of view of parallel programming - I think I read proper threads are in 1.9?
Indeed sir. Just installed 1.9 and it works better in parallel. Well, it's using multi-threads but it's up to the OS to distribute it over the CPU cores - which it isn't doing well, being Apple and all (Leopard on the box - can't upgrade).

It runs quicker on my much-lesser-spec Windows box. Ah well.

I have a sample 1Gb/35 million line file churning over now... 3hrs approx that will take... so 180hrs for the big file. It's still crazily slow. I'm just hoping loading the serialisation dumps is quick, otherwise I may have to re-evaluate.
 
Seriously? I had no idea that OSX was also useless at multi-threading? :confused:
Well on my laptop (Snow Leopard) it performs *much* better, but it only has two slower cores. The mac desktop (Leopard) has 8 faster cores, but it can't hold it's own when multi-threaded, the Windows desktop (2 slowest cores) fairs quite well. Maybe it's a 64bit thing, me, or Ruby, or whatever, but it's odd.

Either way, I'll just let it tick over...
 
Just in case you wondered, I got it working.

Had to split it into three different steps, across 12 cores, 3 machines, 3 operating systems, 24gb memory, but it ran in <15mins. One of the steps is the bottleneck - it creates a massive hash and crashes out. Had to run that bit on Ubuntu with no GUI to free up memory :D.
 
I did some profiling and it is definitely the hash causing the problems. I'm going to have a think of a work around, maybe combining Marshall dumps or something, to ease the load. But then I'll have the problem of loading it into memory. Argh.
 
Back
Top Bottom