AMD Bulldozer Finally!

mmj_uk · 7 Nov 2011 at 14:08

IMO they would be better served doubling the capability of each module and reducing the number of them by half, I'm still a big proponent of per core performance and Bulldozer is a laughing stock in that department.

Cinebench single thread scores for example:
Bulldozer @4.7ghz = 1.14 (taken from here)
Phenom II @4.2ghz = 1.24
2700K @4.8ghz = 1.94 (my system)

I just think AMD need to forget this quantity (of cores) over quality strategy, I'd much rather have a 4 or 6 core CPU which is well rounded.

Nelly · 7 Nov 2011 at 14:41

I wonder if it's possible to fry an egg using a Bulldozer CPU and low performance cooler while overclocked with no fan?

David_Vaughton · 7 Nov 2011 at 20:35

8 Cores

Are there many applications that will acutally make use of 8 Cores?

Excluding Adobe that is.............

Gashman · 8 Nov 2011 at 02:00

it isn't designed to be a 'single-thread' monster, that was never ever the intention, not once in the development or the preliminaries was that said to be the case, it is an architecture designed for maximum throughput with dynamic resource sharing, power gating and an high frequencies.

also the cores will always need a clock speed advantage to perform at their best, that again was part of the design groundwork. the architecture was designed to be a hybrid between a speed demon style design for pure clock speed and a decent instruction per clock design, it is neither of those things and sits somewhere in the middle.

but in the grand scheme of things the whole concept of 'instructions per clock' is hyperbole and has no real world relevance, because people might say 'the better instruction per clock the higher the performance of the full processor...?' but it isn't so clear cut as that. performance is always, and always will be based on three factors, firstly the amount of instructions a core can execute in each cycle (which varies depending on coding, it isn't constant and never ever has been! so its irrelevant to say 'well the 2500K does X instructions/cycle' because it is entirely dependant on the sort of code being executed!), secondly the core frequency or basically the number of cycles in a second (again, this has just as much relevance as instructions/cycle) and thirdly number of threads capable of being executed in each cycle (core count effectively).

instructions/cycle is meaningless without a frequency, and frequency is meaningless without instructions/cycle, so a processor can do tons of instructions but without a meaningful frequency it will be pointless. the thinking behind Bulldozer was therefore, keep the instructions/cycle equal to 'Stars' (which granted isn't working at the moment, but that could be caused by an absolute ton of factors, even as simple as fabrication issues) and make a design that achieves much higher frequencies, because frequencies (cycles) is a constant and instructions/cycle isn't so it makes sense to make the pursuit of frequency an appealing goal.

take into account that each Bulldozer integer core has 'more' theoretical performance inside of it than those on Phenom II, now assuming that there is a front-end or cache issue that is causing the instructions/cycle to lower regardless of the fact it has 'more' potential power you could indeed conclude that JF-AMD statement that IPC has been improved over Phenom II are true, because in this situation Bulldozer II (B3) would have an advantage in clock speed over the older models whilst maintaining the same sort of instruction throughput for each core, more in a non-sharing workload.

so have to disagree hugely on the statement that AMD need to go back to try to steal the pure instruction/cycle crown from Intel, because as it stands Bulldozer is still on paper a wonderfully powerful architecture which is having some obvious teething problems and is bleeding performance from some areas as well as bleeding wattage, the fact is though that Bulldozer is intended to be a balanced design that doesn't relentlessly chase instructions/clock and doesn't relentlessly chase clock speed, made a decision that 1) the extra resources in the 'Stars' core most of the time sit doing nothing (who are we to argue with that logic, it is their architecture, they know more about it than we do!) and 2) most code can't satisfy the demands of the 'instructions/clock' monster so having the extra resources is theoretically pointless, so instead they went down the streamlined route and put another complete integer core in each module, giving it the potential to execute another thread in parallel without adding significant amount of die-space.

would like to make a bit of a prediction regarding B3 personally speaking, it will have better power consumption figures and lower temperatures, also suspect we will see an instruction/clock improvement through cache/front-end tweaks, Intel will always win in these 'meaty' threads that make best use of their instruction/clock based architecture where their Sandy Bridge chip is unrivalled but Bulldozer will win in situations where a program has been coded so that instead of one or two big, meaty threads you get six or so much more streamlined threads, that is the sort of situation Bulldozer will excel in the future. also since the programming world is moving away from the arcane 'meaty' thread style approach and more into the world of optimising for more cores, I can't see how anyone thinks that AMD's decision is remotely a negative one for the future?

also finally (sorry for long post again :rolleyes:

my bad!) you won't be able to fry eggs on a Bulldozer chip because it will throttle back very quickly when overclocked on a terrible cooler, temperatures will rocket and the processor will go into a lower power state to keep thermals under control, like more or less all modern processors do!

sunama · 8 Nov 2011 at 02:40

Gashman it really does sound like you are attempting to polish a turd.

What I will say is that the first iteration of this product is poor and at current pricing does not offer good value for money.

However, if they can reduce the price (which will happen once retail channels are properly supplied), then the FX81XX CPUs could offer a very good basis to build a mid-range computer.

With regards to IPC: my gut feeling is that to improve IPC, it will take a lot of work. It might be easier to simply ramp up clock speeds, which is basically what Intel did with their P4 CPU.

BD is to Phenom 2, what P4 was to P3 (when it first released).

I'm pretty sure that AMD will choose to use the method which is easiest. Personally, from a marketing perspective, it would be easier to market a 5Ghz, 8core cpu (with low IPC), than a 4Ghz, 8 core cpu (with high IPC).

In any case, if you want the best CPU OR the CPU which offers the best value for money, go with Intel. Right now BD just isn't worth the money.

Tuvoc · 8 Nov 2011 at 07:37

sunama said:
In any case, if you want the best CPU OR the CPU which offers the best value for money, go with Intel. Right now BD just isn't worth the money.

Depends what you are doing with your PC. I had been waiting for 8 core bulldozer. After Bulldozer came out, I saw the performance numbers and realised this first release was not so great. So I promptly went out and bought a Phenom II 1100T. Not Intel. I am now using my 6-core machine in a way that I simply could not do with any quad core CPU, including Sandybridge. I was constantly running out of cores with my existing Intel quad, could not do everything I wanted to do. Now with 6 affordable 1100T cores I can

sunama · 8 Nov 2011 at 09:23

Tuvoc said:
I was constantly running out of cores with my existing Intel quad, could not do everything I wanted to do. Now with 6 affordable 1100T cores I can

So are you saying that a 1100T is faster than a 2600K (with its 8 HT cores)?

Plus1 · 8 Nov 2011 at 10:10

sunama said:
So are you saying that a 1100T is faster than a 2600K (with its 8 HT cores)?

The 2600K is not priced the same as the 1100T/2500K. And i see sense in going for a 1100T, if you are doing things such as virtual machine/device emulation testing, where you will assign each core to a different device.

Tuvoc · 8 Nov 2011 at 13:31

sunama said:
So are you saying that a 1100T is faster than a 2600K (with its 8 HT cores)?

No, I am saying that I can do more with 6 *real* cores than 4

And of course the 1100T is much better value. The Intel competitor for me was the Intel Core i7 980 6-core, massively expensive but of course much faster

So - I replaced my Intel quad box with an AMD 6 core.

Gashman · 11 Nov 2011 at 12:43

speaking of Hyper Threading, how is it that it is offering so much more performance these days than it used to, what has changed inside Sandy Bridge that makes Hyper Threading so much more effective, used to remember it gave rather small boosts in performance at times and others it used to hamper performance. :confused:

mmj_uk · 11 Nov 2011 at 13:24

Gashman said:
speaking of Hyper Threading, how is it that it is offering so much more performance these days than it used to, what has changed inside Sandy Bridge that makes Hyper Threading so much more effective, used to remember it gave rather small boosts in performance at times and others it used to hamper performance.

The poor performance before was probably more down to the Netburst architecture as a whole, P4's had extremely deep pipelines.

Hyper Pipelined Technology
Intel chose this name for the 20-stage pipeline within the Willamette core. This is a significant increase in the number of stages when compared to the Pentium III, which had only 10 stages in its pipeline. The Prescott core has a 31-stage pipeline. Although a deeper pipeline has some disadvantages (primarily due to increased branch misprediction penalty) the greater number of stages in the pipeline allow the CPU to have higher clock speeds which was thought to offset any loss in performance. A smaller instructions per clock (IPC) is an indirect consequence of pipeline depth—a matter of design compromise (a small number of long pipelines has a smaller IPC than a greater number of short pipelines). Another drawback of having more stages in a pipeline is an increase in the number of stages that need to be traced back in the event that the branch predictor makes a mistake, increasing the penalty paid for a mis-prediction. To address this issue, Intel devised the Rapid Execution Engine and has invested a great deal into its branch prediction technology, which Intel claims reduces mis-predictions by 33% over Pentium III.

Intel has replaced NetBurst with the Core microarchitecture, released in July 2006, which is more directly derived from 1995's Pentium Pro than it is from NetBurst. August 8, 2008 marked the end of Intel NetBurst based processors. The reason for NetBurst's abandonment was the severe heat problems caused by high clock speeds. While Core- and Nehalem-based processors have higher TDPs, most processors are multi-core, so each core gives off a fraction of the maximum TDP, and the highest-clocked Core-based single-core processors give off a maximum of 27 W of heat. The fastest-clocked desktop Pentium 4 processors (single-core) had TDPs of 115 W, compared to 88 W for the fastest clocked mobile versions. Although, with the introduction of new steppings, TDPs for some models were eventually lowered.

If only AMD had learned from Intel's mistakes we wouldn't have Bulldozer.

sunama · 11 Nov 2011 at 16:20

Tuvoc said:
No, I am saying that I can do more with 6 *real* cores than 4

And of course the 1100T is much better value. The Intel competitor for me was the Intel Core i7 980 6-core, massively expensive but of course much faster

So - I replaced my Intel quad box with an AMD 6 core.

I have a question.

I am creating a program (server environment) which is highly multi-threaded. It is quite possible that that program will run faster on a 1100T than a 2500K (both similar prices).

My previous understanding was that:
the 2500K would be faster than the 1100T, even when dealing with multiple threads.

If I use the 1100T (with 6 cores), the work that each core can do is significantly less than what the 2500K can do and as a result, even on heavily multi-threaded aps, the 1100T will still struggle to beat the 4 core 2500K.

Also consider that I shall be overclocking (I have a watercooling system), so both CPUs would be clocked as high as they can go. If this is the case, then surely the 1100T would not be the way to go...correct?

Let me know if you have an alternative argument.

eddiew · 11 Nov 2011 at 16:47

sunama said:
Let me know if you have an alternative argument.

As a server application, do you genuinely expect occasions where you have multiple (more than 4) threads using 100% of CPU time each, for any sustained period?

I had to write something for work that does about 3 seconds worth of CPU heavy numbercrunching per call to the server. Single core is out of the question for a production server, dual core is barely adequate; 4 cores, even at peak service times, always has an idle core.

Given that an i5 has better IPC, AND will clock significantly higher than an 1100T, if I were building a rig for the job I'd choose the i5, as it would probably shave 30% of my single thread completion time. I've also found that Intel based servers are genuinely better than AMDs for said crunching, although I'm not entirely convinced this isn't something to do with Microsoft and Intel being up a tree...

(Really you need some stats on maximum/average concurrent requests and average runtime of each thread. You may find you're making a choice at peak times of either serving 2 more requests simultaneously, OR finishing each individual request in a shorter time. Your mileage WILL vary

)

sunama · 11 Nov 2011 at 16:59

eddiew said:
I had to write something for work that does about 3 seconds worth of CPU heavy numbercrunching per call to the server. Single core is out of the question for a production server, dual core is barely adequate; 4 cores, even at peak service times, always has an idle core.

Given that an i5 has better IPC, AND will clock significantly higher than an 1100T, if I were building a rig for the job I'd choose the i5, as it would probably shave 30% of my single thread completion time.

This is worded nicely and is exactly the reason why I would choose a 2500k, over a similarly prices 1100T, even in heavily multi-threaded environments HOWEVER, Tuvoc feels otherwise and is the reason why I am questioning this reasoning. Perhaps my argument and the one above, is wrong?

CAT-THE-FIFTH · 11 Nov 2011 at 17:05

sunama said:
I have a question.

I am creating a program (server environment) which is highly multi-threaded. It is quite possible that that program will run faster on a 1100T than a 2500K (both similar prices).

My previous understanding was that:
the 2500K would be faster than the 1100T, even when dealing with multiple threads.

If I use the 1100T (with 6 cores), the work that each core can do is significantly less than what the 2500K can do and as a result, even on heavily multi-threaded aps, the 1100T will still struggle to beat the 4 core 2500K.

Also consider that I shall be overclocking (I have a watercooling system), so both CPUs would be clocked as high as they can go. If this is the case, then surely the 1100T would not be the way to go...correct?

Let me know if you have an alternative argument.

Server CPUs are not overclocked. On top of this trying to look at desktop workloads and trying to predict how well it will do in a server environment does not make sense. It depends on what sort of applications you are running,what OS you are running and the infrastructure you are using. On top of this support for certain extensions can also be very important.

A lot of the fastest supercomputers in the world are not necessarily using the CPUs with the fastest cores even if they are available.

Even in the desktop environment there are instances where even a Phenom II X6 clock for clock is not slower than a Core i5 2500. The same can be seen with previous generation Core i7 CPUs too when compared to a Core i5 2500. They technically have slower cores but with HT can end up faster.

eddiew · 11 Nov 2011 at 18:24

CAT-THE-FIFTH said:
Server CPUs are not overclocked. On top of this trying to look at desktop workloads and trying to predict how well it will do in a server environment does not make sense. It depends on what sort of applications you are running,what OS you are running and the infrastructure you are using.

This is very true. You'd have to do some stats analysis of your common loadings.

Simple Option
However, let's consider throughput as a simple function of work / cores:

i = time taken to run the job on 1x i5 core
p = time taken to run the job on 1x 1100T core

And assuming perfect scaling to n threads...

IF ( i / 4 < p / 6 ) buy i5
ELSE buy 1100T

And there is genuinely nothing I can suggest other than testing your specific app on each processor and seeing whether the above is true.

Really Hard Stuff
This is of course the more realistic option, but bugger me it's a headpeck!

u is the number of simultaneous users that your server can support
h is the number of actual threads used per user request
c is the number of execution cores available
tE is the total amount of execution time required (if run on a single core)
tI is the idle time (per user) between requests

Assumption: perfect thread scaling, i.e. with 4 thread the job takes 1/4 as much time as with 1 thread. This is a lie, but it's a starting point.

One single user on average will chew up a portion of available computing resource equal to:

1 / (((tE + tI) / tE) / (h / c))

E.g. if a task takes 20 seconds on 1 core, with 60 seconds between requests, and we allow 2 threads per user on a quad core...

1 / (((20 + 60) / 20) / (2 / 4))
= 1 / (4 / 0.5)
= 1/8 th of all computing resources are going to one user

(This is the common sense maths: 20 seconds per 80 seconds uses half the CPU cores, i.e. one quarter of one half is one eighth.)

To find your maximum number of concurrent users, take away the 1/ at the start:

Max users = (((tE + tI) / tE) / (h / c))

Guess #1
I'm going to go with blind guesswork now...

tI (user idle time) is a flat 60 seconds between queries.
tE (total execution time) can be 20 seconds for an i5, 25 seconds for an 1100T
h (allocated threads per user) is 2
c is 4 for an i5, or 6 for an 1100T

So for an i5, u = 8, as described above
For an 1100T: u = ((25 + 60) / 25) / (2 / 6) = 10.2 users

More Fancy
Now for the clever bit: let's find how much faster an i5 has to be to deliver better throughput. An i prefix is intel, an a prefix is AMD! An i5 is better when:

(((itE + tI) / itE) / (h / 4)) > (((atE + tI) / atE) / (h / 6))

Multiply across...

(h/6) * ((itE + tI) / itE) > ((atE + tI) / atE) * (h/4)

Both sides are now being multiplied by h, we can in fact remove it...

((1/6)(itE + tI)) / itE > ((1/4)(atE + tI) / atE)

Simplifying, the i5 is better when...

(itE + tI) / itE > 1.5 * (atE + tI) / atE

(Intel single thread run time + user idle time) / (Intel single thread run time) > 1.5 * (AMD single thread run time + user idle time) / (AMD single thread run time)

Unfortunately at this point there's nothing to be done without some numbers. Eventually you end up at an inequality where the i5's processing time has to be less than 5 times itself minus 60, which is a bit rubbish

So what you need is your per-request-runtime running single threaded on each type of CPU, and the average time each user will allow between requests.

Or my maths could be wrong

*edit* Bugger me I must be bored...

sunama · 11 Nov 2011 at 18:46

Thats an impressive array of calculations.
I'm actually programming right now, so I didn't go through it with a pen & paper. In any case, your calculations would be making a lot of assumptions. Furthermore, IF we knew the exactly values for the variables, even after all the theoretical calculations, there is no absolute guarantee that those theoretical processing times would match up perfectly to the the "actual" processing times.

Judging by the above answers, the only way of finding out which is better is to try and it out and see (and run heaps of tests/benchmarks).

I can't help but feel that an i5 would be faster, even with fewer cores, especially once over-clocking is factored in. Remember, I am looking at value for money, too...I can't afford Xeon CPUs, hence over-clocking will be factored into the comparison.

ng93 · 11 Nov 2011 at 18:52

sunama said:
Thats an impressive array of calculations.

My thoughts exactly.

Sunama, being as we're in the Bulldozer thread, have you considered an 8120? Depending on your operating system and workload they could potentially perform quite nicely, especially once overclocked. And it'd make a nice heater seeing as winter is just around the corner

eddiew · 11 Nov 2011 at 19:03

sunama said:
Thats an impressive array of calculations.
I'm actually programming right now, so I didn't go through it with a pen & paper. In any case, your calculations would be making a lot of assumptions.

Indeed. I tend to go off on one after I've been out for a run... Also I've just realised that if idle time is zero, then it cancels out to "1 > 1.5", which is a bit of an issue xD

I think I was ok up to Max users = (((tE + tI) / tE) / (h / c)) but it may have drifted after that... ^^;

sunama said:
Judging by the above answers, the only way of finding out which is better is to try and it out and see (and run heaps of tests/benchmarks).

I don't think it would be unfair to say that if you thread limit your program to, say, 2, then you're looking for the i5 to complete it in 2/3rds the time that the 1100T does, reflecting it's 50% less cores.

Given that 100% scaling when adding threads never quite works, I suspect if the i5 was in the region of 70-75% as much execution time, it would be the better contender in general, and have a clear advantage in light load scenarios.

You could also consider an i7, which I think would be unarguably faster

sunama · 11 Nov 2011 at 19:07

The problem with BD(1), is that once overclocked, you are looking at higher power usage.
When you overclock the 2500k, the power usage is still acceptable. Consider that both the 8120 and 2500k are similarly priced.

Consider that the computer shall be on 24/7 and the plan would be to keep it for at least 3 years.

If AMD can sort out the heat output and high power usage (when clocked high), BD will comes into the equation, until then, the only AMD CPU to consider would be the 1100T.