Ruby speed

meghatronic · 7 Apr 2010 at 12:02

I'm trying to Marshal a 60gb (2.07 billion lines) but the thing is going to take about a day to process before I Marshal it (rough calculations). WTF? That's 0.5Mb/s.

while(line = file.gets) or file.each do |line| similarly slow. Any alternatives?

meghatronic · 7 Apr 2010 at 12:09

Dj_Jestar said:
What are you actually doing to each line?

At the moment, creating a hash to dump - which I realise is pretty stupid now as I'll run out of memory.

meghatronic · 7 Apr 2010 at 12:12

Dj_Jestar said:
You're hashing 60gb and complaining it's taking ages? :/

It still shouldn't take this long though!

meghatronic · 7 Apr 2010 at 12:25

Dj_Jestar said:
If you take out the hashing, and just run the loop, how long does it take?

Also, writing to console/stdout will slow things down a lot.

Yeah - you're right, the hash is really slowing it down. I can't see an alternative though.

Basically I had a script generate this data, and write the results to a file.
Then had another script read the data and plod on with it. I've now merged the two, Marshalling at the end, so it's doing no disk access now. It just has to read a small file into memory and do some calculations (billions of them!) which takes 10hrs. It's looking like the hashing is tripling that, but ah well.

I could speed it up if there was a way to write iteratively to the dump.

At the moment it is doing:

h={}

blah.do
calcs
h[z]= zxyyhysdadsf
end

dump > h

If I could bypass the hash, it might speed up.

meghatronic · 7 Apr 2010 at 12:42

nightwish said:
I don't think any scripting language is really great for raw speed. Why not do it in C?

I'll have a look. To be honest, if loading the serialisation is quick... it really doesn't matter how long it takes to create it. I can just ignore it - assuming I don't run out of memory.

meghatronic · 7 Apr 2010 at 13:39

I'm not entirely sure how one would multi-thread a file read.

meghatronic · 7 Apr 2010 at 13:50

NathanE said:
It sounds like what he is writing is some sort of batch job script. In these sort of scenarios you can afford to take more consideration of "expected hardware" into account. Therefore, if it is multi-core, write the script to be multi-threaded. If the hardware is single-core - don't bother as it's just wasting your time.

Indeed. 8x3.2Ghz cores, 16gb ram.

Dj_Jestar said:
One thread for file reading, many for hashing. Using some sort of event or observer to notify between the threads. Many threads to complete the task of hashing in parallel. I.e. a pool of "hashing" threads.

Wooosh. Over my head. I have reading to do ! !

meghatronic · 7 Apr 2010 at 14:21

NathanE said:
It would be about 30-40 lines of C# (my current native tongue!) for the basic multi-threading stuff.

I'm sure Ruby can do it in less.

I've done it before*, I just need to get my head round the problem - and what I'm trying to do.

(* ruby is quite simple: http://www.tutorialspoint.com/ruby/ruby_multithreading.htm )

meghatronic · 7 Apr 2010 at 16:02

NathanE said:
Personally I would use a thread pool. I'm sure Ruby has one. It saves a lot of boilerplate-code/hassle WRT to spawning and lifecycle of the threads.

Just spent a few hours trying to implement a threadpool, and for whatever reason it's ended up slower. Stupid Ruby.

May do it in Java now...

EDIT - seems to be slower on OSX, so I'm blaming 1.8.6 and upgrading now.

meghatronic · 7 Apr 2010 at 19:34

azteched said:
For the Ruby 1.8 series at least with the normal reference implementation, IIRC threads don't run in parallel, it basically just has a scheduler in the vm which timeslices; "green threads" in ruby parlance.
Maybe it's similar to the whole GIL Python thing.

This is obviously crap from the point of view of parallel programming - I think I read proper threads are in 1.9?

Indeed sir. Just installed 1.9 and it works better in parallel. Well, it's using multi-threads but it's up to the OS to distribute it over the CPU cores - which it isn't doing well, being Apple and all (Leopard on the box - can't upgrade).

It runs quicker on my much-lesser-spec Windows box. Ah well.

I have a sample 1Gb/35 million line file churning over now... 3hrs approx that will take... so 180hrs for the big file. It's still crazily slow. I'm just hoping loading the serialisation dumps is quick, otherwise I may have to re-evaluate.

meghatronic · 7 Apr 2010 at 20:36

NathanE said:
Seriously? I had no idea that OSX was also useless at multi-threading?

Well on my laptop (Snow Leopard) it performs *much* better, but it only has two slower cores. The mac desktop (Leopard) has 8 faster cores, but it can't hold it's own when multi-threaded, the Windows desktop (2 slowest cores) fairs quite well. Maybe it's a 64bit thing, me, or Ruby, or whatever, but it's odd.

Either way, I'll just let it tick over...

meghatronic · 8 Apr 2010 at 01:43

Just in case you wondered, I got it working.

Had to split it into three different steps, across 12 cores, 3 machines, 3 operating systems, 24gb memory, but it ran in <15mins. One of the steps is the bottleneck - it creates a massive hash and crashes out. Had to run that bit on Ubuntu with no GUI to free up memory

.

meghatronic · 8 Apr 2010 at 11:56

azteched said:
Maybe it's a Ruby interpreter issue, dunno though.

I think it could be. 1.9.1 definitely runs better on SL than L.

meghatronic · 8 Apr 2010 at 17:11

I did some profiling and it is definitely the hash causing the problems. I'm going to have a think of a work around, maybe combining Marshall dumps or something, to ease the load. But then I'll have the problem of loading it into memory. Argh.