Cloud server?

Associate
Joined
30 Jan 2008
Posts
467
I'm building a data mining app at work which looks for associations between stuff stored in a MySQL database. It's a tad on the intensive side and can max out all four cores on my i7 quad core machine for about 2-3 days.

At the moment it's running inside our LAN, but we need to access the data from our web servers so I am wondering whether it might be a good idea to stick it on a cloud server. Maybe one of those EC2 things.

Does anyone have any experience of this? Does this seem a good idea? The main reason for thinking EC2 might be a wise move is that we may sell access to the data mining tool so third parties can utilise it, and we'll need something scalable for that.
 
Heavy CPU work is the exact kind of thing the cloud is useful for, particularly if you might want to run a LOAD of analysis one week then nothing the following week.

Depending on your requirements you might want to look at how well the calculations could be performed once then resold repeatedly, compared to many separate people needlessly repeating work.
 
Thanks. Sounds ideal, then. Yes, the calculations only need to run once a week. I'm currently optimising the code to reduce the number of queries required and try to improve performance. Amazing how much processing power it can use up.
 
Oh another thing is bandwidth - how much will you need? I'd look, personally, at avoiding the unmetered ones if you can help it - those tend to attract people who know they'll be using a heap of bandwidth, nobbling the server a little.

For me, Rackspace has worked very well. I'm assuming here that you're already adept at installing and configuring a linux server as you see fit; that's pretty much what you'll get! (Just a root password and an IP address)
 
The Rackspace 'cloud' is just a standard Xen VDS setup albeit wil flexible billing. You'll likely have less raw CPU power available than on your current setup - any VDS will always be slower than dedicated hardware simply because it's just a piece of a server.

I'd investigate Amazon EC2 to see if it's suitable as it's probably the best cloud system in terms of the sheer scale of what you can do. It's definitely not a non-geek tool though!
 
I'd verify what's holding it up, I've seen very few databases limited on CPU for a long time - even doing fairly complex stuff, SSDs are the best database speed up tool going currently (actually that's not quite true, the best is loading your entire database into RAM). If it really is CPU though then it's a toss up, unless your code scales *really* well then you likely won't get any speed advantage from EC2 but if that doesn't bother you then it gives you somewhere to offload the analytics to and forget about it until it's done which can be nice. It isn't as cheap as all that though, particularly if you're running it frequently for long periods. It also isn't, as has been said, a tool for novices, even geeky novices - you really need some programming experience to get to grips with it quickly in my view.

The other consideration is security of course, Amazon have a few options in terms of secure cloud with VPNs back home and firewall rules of sorts but it's not, in my view, as transparent and reassuring as an actual firewall in front a server in a locked rack if the data is propriety and/or valuable.

The Rackspace 'cloud' is just a standard Xen VDS setup albeit wil flexible billing. You'll likely have less raw CPU power available than on your current setup - any VDS will always be slower than dedicated hardware simply because it's just a piece of a server.

I'd investigate Amazon EC2 to see if it's suitable as it's probably the best cloud system in terms of the sheer scale of what you can do. It's definitely not a non-geek tool though!

That's exactly what EC2 is as well, Xen with a fancy billing and provisioning frontend. I don't know what else people imagine it to be, all the 'clouds' are just VPS platforms with fancy business models...
 
Last edited:
I would hope a cloud would have some form of shared storage as so when a host dies your 'cloud' data doesn't vanish.

The rackspace stuff isn't bad. I've used a few servers on it and they run nicely.
 
(actually that's not quite true, the best is loading your entire database into RAM).

I would imagine that utilising RAM disks effectively is what Amazon would prefer people to do on their virtual instances given the amount of RAM they offer with their EC2 instances.

But as has been mentioned/alluded to already - shared disk I/O on EC2 (and most "cloud" providers) is pretty shocking by all accounts, and the amount of CPU available to some of the smaller instances is also pretty compromised.
 
Thanks for the useful feedback everyone.

@blueacid: Bandwidth usage should be fairly low. We'll just be pulling in recent purchase data and then accessing some of it via a webservice. Shouldn't be too bad.

@bigredshark: Yes, I'm working on optimising it! The performance/speed issue is really just down to the volume of queries being made, I think. It's using an apriori type algorithm, which has a tendency to bit a tad intensive - might change this to something quicker. I'm adding more tree-pruning steps to it to cut down on the number of queries, speeding things up a bit.

Security wise, it really only contains customer IDs and binary scores, plus product names and prices. Nothing really personally identifiable, or valuable to anyone else. Anything remotely personal tends to be encrypted.
 
Back
Top Bottom