Serious Overclocking Question

Associate
Joined
9 Jun 2009
Posts
1,398
Location
Suffolk
I am a computational physicist working with very fragile (numerically) codes, now I have access to clusters etc. to run my work (some simulations are requiring upward of 300,000 CPU hours to complete), but I was wondering whether overclocking a processor would change the outcome of these simulations, or whether we are now at a stage with CPU design that it wouldn't make any difference.

As an aside, I have spoken to people who think that increasing the number of cores on the processor will not have any tangible effect on the speed up of some of the quantum codes I use, as the cache on the die will be too small for the instructions, and so will hamper the speed of the program. Is there any computational scientists who know about all this, and whether these people are correct?

Don't worry before anyone asks I am not the cluster manager, and I won't be overclocking it :P
 
I suppose the closest most of us would have come to this is using programmes like folding at home etc... where unstable over clocks can effect the result of a work unit.
 
all computers work through components called gates. to get the gates to work there is a clock set to them so that information can only travell when the computer is told to. Overclocking simply raises the speed at which the signals can go through (the frequency) By doing this it reduces the life of the components and makes the signals stronger, but because the signals are traveling down a specified route they cannot change no matter how much you overclock as long as there is a time when the clock is on and when it is off. This means that you should be safe to overclock the system as long as you do it properly and make sure the system is stable. I would also recommend backing up the numbers onto removable drives just in case the computer fails.

hope this helps
 
As you may have gathered, probably not the right place to ask.

Just how fragile is the code? If you push a computer too hard rounding errors start to creep in, if these can accumulate then accuracy is always going to suffer. But this will happen at stock or when overclocked.

All I can offer is that a stable overclocked computer will get exactly the same answers as a stable stock speed one. If the code gives variable answers at stock, you should get the same distribution of results when overclocked. They're normally tested by generating obscene numbers of prime numbers or solving a large number of simultaneous equations.

A stock speed computer will give different answers to the same question when asked repeatedly if the code is too complex/poorly written, an unstable overclocked one will give variable answers to simple questions as well.

The potential issue lies with a nearly but not quite stable overclock. It would be wise to run a largish number of simulations, then overclock and run them again. Not only time it, but also compare the outputs to verify it hasn't changed its mind. This on top of the standard stability testing. It'll never come close to the grid, but it can make working "offline" significantly nicer.

Interested to hear of any successes/failures if you decide to trial this.
 
When overclockers say stable - there are a lot of different interpretations, IMHO the general consensus is that 24hrs prime95 is considered 'stable' by this overclocking community.

I am unaware of what stability requirements your cluster manager needs, he may have a fault tolerance contract with your hardware supplier that may specify specific causes.

In terms that these forums readers may understand - supposedly he specified that it be Prime95 stable for 168hrs (1 week), only a few manufacturers deliver that level of error recovery/tolerance and I doubt anything but server grade hardware at stock settings were used. e.g. You don't see ECC memory in our home machines.

Everything is dependant on probability, now you may find one enthusiastic overclocker exclaim his/her system is 'stable', but multiply that chance by the cluster size e.g. 64/128/256 computational units - then you will see your failure probability exponentially increase.

Now if you use stock settings - then a cluster manager can more or less use predictive data to quantify the failure probability and plan to suit. Overclocking is a random element - to be minimised at all costs.

I'm not sure if this actually answered your original question about CPU design, but I'm inclined to believe a manager would add 50% more of a cluster than overclock it 50%.

Unless you are talking about extreme cases, IBM and Cray make liquid Freon cooled cluster supercomputers - these are 'overclocked'. These are a different league to even a consumer available Vapochill sub-zero case that it's beyond the scope of these forums.
 
Nice one Pat, I'd forgotten about ram making mistakes now and again. Processors do too, the old itanium ones had error correcting code and I think the nehalem xeons do but am far from sure of that. Hard drives make mistakes too. I imagine ssds do as well.

I suppose I should define stable for my above post to make sense. A "stable" overclocked computer makes the same mistakes under the same circumstances (sub. in statistical distribution of errors if feeling pedantic) that it does when at stock settings. It's hard to say how much a blind dream this is, all I can be certain of is that a system can pass prime95 and ibt for daft lengths of time and still crash during normal use. I put this down to inadequate testing rather than to an inherent fault in the overclocking process, but it's easy to believe what you want to when there's no evidence either way :)
 
One of the problems with trying to prove a stable system is you don't know which bit of the logic may fail, which is why some machines pass IBT then crash loading IE, different parts of the CPU get used depending on the data and instructions used.
Given the complexity of todays CPUs, it's very difficult (impossible I suspect) to run through every combination of states and check that the CPU is 100% stable. And you'd have to verify all the results, making the whole thing a very slow and laborious task.
 
Back
Top Bottom