Overclocked 5Ghz Sandy Bridge datacentre build out...

DamoTourettes · 7 Feb 2011 at 13:41

joeyjojo said:
Oh yes I was going to mention this too. It's very likely to be in a climate controlled room yes? In which case any decent air cooler will do a good job.

Yes, all machines will be deployed into controlled datacentre enviros - now the interesting bit is finding a rack mountable box that I can crowbar a Noctua NH-D14 into without it being 10U... :eek:

DamoTourettes · 7 Feb 2011 at 13:42

Yoga said:
If I recall correctly, in Windows NT one could manually set an affinity between an application thread and a CPU core. Something Linux must be capable of?

Although your mention of a heavily modded nix environment makes me doubt that

Regarding CPU - you'd be better off with the 2500K, the hyper threading of the 2600K is wasted money, especially with a 50 unit volume.

Cooling. Air coolers are limited by the ambient temps of the room. If the room is kept cool, the Noctua would be the perfect choice. If not, the H70 may be better.

As many have mentioned, 4.5Ghz stable is more likely, a small percent of chips reach 5Ghz. From Asus testing labs:

Full official guide here. You may find it very useful.

That is pure gold. Muchos Gracias!

Troezar · 7 Feb 2011 at 13:45

Here's my guess, they're calculating something to do with real time transactions whereby first past the post makes the deal/buys something/sells something. Parallelism wouldn't help in that case. Seems to make perfect sense to me hence the non-standard request. makes a change from the usual spec requests

Yoga · 7 Feb 2011 at 14:21

DamoTourettes said:
That is pure gold. Muchos Gracias!

You're welcome, good luck

Rroff · 7 Feb 2011 at 14:28

Frank_Rizzo said:
I don't agree that something would return a different figure just because it was overclocked. If there was data corruption from an overclock there would be something more catastrophic than a wrong figure computed. I'm not saying it can not happen, and I am sure some will give an example, but if you run your car at 10,000rpm continually it is more likely the cam belt will snap, or piston rings blow, rather than the wrong channel on the radio being played.

---

If I could take a guess the boxes are to be used for some kind of arbitrage or hedging. Bots scanning for price changes, previous data checked, an optimum buy / sell / back / lay price calculated, transaction requested on a third party site

This would not need enterprise. You need the fastest practical o/c'd box you can get.

Why do you think things like prime are used for stability testing? look at what happens when Super Pi fails, etc. Sure what happens when an overclock fails isn't set in stone, it might crash outright, hardlock or just very very slightly get things wrong not to a level anything stops working but so that the data calculated is wrong - but usually results in incorrect output for an operation. Also you have to take into account often the FSB, chipset, memory, etc. is overclocked too incorrect data can slip in anywhere.

Seems to be a lot of naivety in this thread... slightly scary given the context but I've said my piece.

Frank_Rizzo · 7 Feb 2011 at 14:34

A bit of patronising too.

I was just not aware that overclocking could result in a wrong figure being calculated. Maybe you can never be certain even after a 24hr pass. Maybe a wrong figure could be calculated at stock given a long, long run time.

y = 5/7
x = 0
do while x < a very big number
if 5 / 7 <> y
echo 'Alert! Wrong number calculated.'
x++
endif
enddo

I just think it is alarmist to say that perform overclocking and your specific numbers are going to be way out, but the rest of the o/s will run fine.

tom_e · 7 Feb 2011 at 14:43

Frank_Rizzo said:
I just think it is alarmist to say that perform overclocking and your specific numbers are going to be way out, but the rest of the o/s will run fine.

Its not alarmist its true, I've had oc'd systems that could run day to day with no noticable issues, but I ran prime to test the stability and it failed after a few hours.
The reason prime fails is that the system calculates one of the results incorrectly.

Rroff · 7 Feb 2011 at 14:44

It depends how important the information is he is dealing with, he was suggesting it was important and linked to financial data - and there is a reason why that data is usually computed on high end hardware typically from IBM.

Yes even at stock a CPU can potentially get operations incorrect - which again is why I don't like the use of consumer CPUs in this context as they are less stringently binned, overclocking increases the chances of things like this happening.

Take a look at stuff like

http://en.wikipedia.org/wiki/Pentium_FDIV_bug

In this case it was a bug but pushing a CPU too far could for instance result in corruption of the data in the lookup table.

EDIT: Sorry if I come over patronising, I don't find it easy to explain what I know in my head... and I've seen the aftermaths of when this kinda thing goes wrong before.

Vince · 7 Feb 2011 at 14:54

Have you asked the question as to wether the data center will let you put overclocked standard kit in their data center. I know the guys in our DC will only let us have approved kit from the likes of dell hp etc.

Also working in financial services i think the FD and COO here would shoot me down in a ball of flames if i suggested using overclocked kit to run any of our business applications.

Rroff · 7 Feb 2011 at 16:10

Host we use does have provision for less uh standard setups for colo, but they won't allow water cooling (even stuff like the H50/70) in rack mounts and they still charge based on 1U, 2U, etc. scale for standing space so for a decent sized ATX case its none too cheap. (they also have restrictions on thermal parameters and amperage which overclocked kit would probably be pushing).

DamoTourettes · 7 Feb 2011 at 16:29

We're fine with the amperage and cooling connotations - we'd be looking to deploy this solution in a lot of places around the world, so have had varying responses from some DC vendors - funnily enough the Germans and Americans are anti the H50/70, however the Brits and Canadians have a more relaxed attitude. Ha!

With regard to stability etc, Frank_Rizzo is entirely correct. "Number crunching" seems to be the generic term you're using Rroff for what the practical application of what we're doing is, however "number crunching" is best performed by specialised hardware in a context that supports large message loads along with big computational dynamics - something like GPU's for instance. Unfortunately I can't say anything else beyond that.

Vince · 7 Feb 2011 at 16:49

Another question.... in 12 months time the Asus Maximum IV Extreme Intel P67 motherboard
goes eol and your servers in the DC start failing either do to OC or some other reason. What then is your contingency plan?

With supported server boards the manufacturer guarantees a replacement board the next day to repair your server. However if you go down the consumer (gamer) board with an OC'd SB you have no such fall back unless of course your going to stock pile the board.

Secondly memory? You would, if using a non server board loose the ecc functionality of your memory, I don't know how much of an issue this would be for you but it's just one more thing counting against the proposal.

Lastly, there are known issues with the sata 6 connections on the boards. I guess seen as its in a data center you would be using some kind of sas raid array on it? and not building an array from the onboard sata?

Roff - we also have provision for non standard setups and were using this part of the dc for our sharepoint migration, our hosts call it "complex managed", which allow us to put our own kit in, we needed this to run our own old tower servers alongside the rack mounted behind the same firewalls etc

Personally I wouldn't touch the setup in the op with a stick, just off the top of my head... what happens if the person that builds the rig leaves? who will support the OC'd rigs? There seems to be so many reasons why you shouldn't do this and only one why you should... performace.

DamoTourettes · 7 Feb 2011 at 17:02

VinceB1 said:
Another question.... in 12 months time the Asus Maximum IV Extreme Intel P67 motherboard
goes eol and your servers in the DC start failing either do to OC or some other reason. What then is your contingency plan?

With supported server boards the manufacturer guarantees a replacement board the next day to repair your server. However if you go down the consumer (gamer) board with an OC'd SB you have no such fall back unless of course your going to stock pile the board.

Secondly memory? You would, if using a non server board loose the ecc functionality of your memory, I don't know how much of an issue this would be for you but it's just one more thing counting against the proposal.

Lastly, there are known issues with the sata 6 connections on the boards. I guess seen as its in a data center you would be using some kind of sas raid array on it? and not building an array from the onboard sata?

Roff - we also have provision for non standard setups and were using this part of the dc for our sharepoint migration, our hosts call it "complex managed", which allow us to put our own kit in, we needed this to run our own old tower servers alongside the rack mounted behind the same firewalls etc

Personally I wouldn't touch the setup in the op with a stick, just off the top of my head... what happens if the person that builds the rig leaves? who will support the OC'd rigs? There seems to be so many reasons why you shouldn't do this and only one why you should... performace.

1. If the boards go eol, so be it. The point is this - if they perform as expected we'll have made so much money their cost becomes absolutely immaterial - an OC'd box like that I've proposed is vastly cheaper than a dual processed Xeon server anyway, and is infinitely faster for what we want it to do.

2. Memory? Again, non ECC is faster than ECC mem, which is more appealing for us. End of story.

3. Point 3 - see point 1.

DamoTourettes · 7 Feb 2011 at 17:03

VinceB1 said:
There seems to be so many reasons why you shouldn't do this and only one why you should... performace.

Got it in one. :cool:

robbiemc · 7 Feb 2011 at 17:23

This is indeed an interesting thread - but sorry to sound like a nay-sayer, but does seem a tad naive.

Have you done a feasibility study outlining worst case scenarios with associated overclocking issues?
This should involved uptime, accuracy of data and associated man-power.
As mentioned in this thread, Enterprise hardware is not only designed to be resiliant, redundant and (important in your context) accurate.

ECC is designed to ensure errors don't creep into results - something that non-enterprise CPUs and Memory don't offer (especially when overclocked) - it's there for a very good reason!

You seem to be hell-bent on speed. But what's more important, speed or accuracy? In between crashes, you'll have no idea if your uber OC'ed rig will be providing reliable numbers - does your boss realise this?

Crunch some numbers - include everything. I'm amazed that a business so focussed on crunching lots of numbers doesn't seem to care about any kind of up-time or accuracy.

Rroff · 7 Feb 2011 at 17:30

^^ I think we are taking it too seriously, and what hes doing isn't actually that critical or financial based, probably mapping trends or something where accuracy isn't that important in a general sense. Otherwise it seems like sheer madness to be running without ECC, resilience, etc.

Rroff · 7 Feb 2011 at 17:31

Frank_Rizzo · 7 Feb 2011 at 17:38

robbiemc said:
But what's more important, speed or accuracy? In between crashes, you'll have no idea if your uber OC'ed rig will be providing reliable numbers

I can't have this. If an O/C rig is going to fail it's not going to just give spurious numbers.

Really unreliable O/C = no post
Unreliable O/C = crash in the o/s
Flaky O/C = crash some app / driver
Stable O/C = stable running rig

Stable O/C may fail some time in the future due to component stress (capacitance break ...)

That is the usual scenario. All PCs clocked or not will ultimately fail due to component failure. All crazily overclocked PCs will be unreliable and not boot, crash in o/s, crash a driver or app.

But to say that overclocking makes 1 + 1 sometimes = 3, or more precisely 22/7 sometimes = 3.14285715

Just seems a bit far fetched.

brendan · 7 Feb 2011 at 17:48

Frank_Rizzo said:
I can't have this. If an O/C rig is going to fail it's not going to just give spurious numbers.

...

But to say that overclocking makes 1 + 1 sometimes = 3, or more precisely 22/7 sometimes = 3.14285715

Just seems a bit far fetched.

Actually not really, without ECC, you're never sure when your memory gets corrupted. Running memory overclocked will mean that have a higher likelyhood of going wrong, and without ECC you won't know you've gone wrong.

Therefore it's totally possible memory gets corrupted and your 2 becomes 3.

Overclocking (especially voltage increasing) will make this possibility more likely but it doesn't mean it won't happen without overclocking. That's what memtest86 is for!

Anyways the poster seems very keen on doing this, hell it's not my server room or my job, so i say go for it - sounds like a fun project! I wish i was payed to do that at work!

DamoTourettes · 7 Feb 2011 at 17:58

VinceB1 said:
Another question.... in 12 months time the Asus Maximum IV Extreme Intel P67 motherboard
goes eol and your servers in the DC start failing either do to OC or some other reason. What then is your contingency plan?

With supported server boards the manufacturer guarantees a replacement board the next day to repair your server. However if you go down the consumer (gamer) board with an OC'd SB you have no such fall back unless of course your going to stock pile the board.

Secondly memory? You would, if using a non server board loose the ecc functionality of your memory, I don't know how much of an issue this would be for you but it's just one more thing counting against the proposal.

Lastly, there are known issues with the sata 6 connections on the boards. I guess seen as its in a data center you would be using some kind of sas raid array on it? and not building an array from the onboard sata?

Roff - we also have provision for non standard setups and were using this part of the dc for our sharepoint migration, our hosts call it "complex managed", which allow us to put our own kit in, we needed this to run our own old tower servers alongside the rack mounted behind the same firewalls etc

Personally I wouldn't touch the setup in the op with a stick, just off the top of my head... what happens if the person that builds the rig leaves? who will support the OC'd rigs? There seems to be so many reasons why you shouldn't do this and only one why you should... performace.

brendan said:
Actually not really, without ECC, you're never sure when your memory gets corrupted. Running memory overclocked will mean that have a higher likelyhood of going wrong, and without ECC you won't know you've gone wrong.

Therefore it's totally possible memory gets corrupted and your 2 becomes 3.

Overclocking (especially voltage increasing) will make this possibility more likely but it doesn't mean it won't happen without overclocking. That's what memtest86 is for!

Anyways the poster seems very keen on doing this, hell it's not my server room or my job, so i say go for it - sounds like a fun project! I wish i was payed to do that at work!

In my experience if I've had issues with OC's I've generally had problems with OS etc failing long before anything else goes pear-shaped. However, your point is fairly made, and we wouldn't be considering anything of this sort unless we had a *lot* of risk management in place to ensure that after our application does its thing, it's carefully vetted against other systems we have in place. I understand your concern, however...