hyperthreading ? is it reallu useful ??????

CmdrTobs · 21 Jan 2011 at 00:34

ATIorNvidia said:
At "reasonable" loads any performance losses you could ever get would be extremely minimal (inside 1-3% most likely), at higher loads HT will show obvious gains.

When you say higher, you are really talking about test scenarios.

ATIorNvidia said:
Not everybody builds a PC to play Call of Duty. There's other things besides gaming, y'know? Show me any programs that use more than 4 cores and don't benefit from HT, until then you don't really have a case.

No not everyone does build a PC for that, but most home applications have a user interaction focus and have more in common with call of duty than some $10K render suite which you should be running on a dual socket Xeon system anyway...

ATIorNvidia said:
Show me any programs that use more than 4 cores and don't benefit from HT, until then you don't really have a case.

Show me any programs that can *balance* load 4 cores? Until you do that you actually don't have a case for HT over the namesake technique of this board.

ATIorNvidia said:
Rubbish. Prove it.

google my evidence for me I'm

ATIorNvidia · 21 Jan 2011 at 00:44

CmdrTobs said:
Show me any programs that can *balance* load 4 cores? Until you do that you actually don't have a case for HT over the namesake technique of this board.

How petty and childish. Until you can answer my question properly, don't bother replying.

CmdrTobs · 21 Jan 2011 at 01:18

ATIorNvidia said:
Yes, really. 8 cores and 8 threads. I can even try and dig out a post made by John Fruehe (AMD employee) over at the OCN forums where he stated that. Each module has two threads and two cores.

Learn to decipher marketing speak. The bulldozer module is actually just a CPU. What a few years ago became known as a 'core' when Intel started sticking 2 P4's die's in a case. I would persist with the term die, but now circuits are deposited more than one 'Cpu' (or now core) per crystal waffer so die becomes inappropriate and CPU now solely describes what plugs into a socket. The word thread outside an OS software concept is a to be strongly avoided unless the aim is to confuse and sell chips to the likes of you.

What AMD is insisting is actually a "bulldozer module", is really a bulldozer core and each of these cores are capable of executing two threads simultaneously. This is better than intels current method as essentially the integer stuff (think common processor duty stuff) is duplicated. So it seems this is more of an intermediate step between a real core and an imaginary one from microcode (intels HT)

Which goes back to what I said earlier along the lines off: "Amd bringing something out in bulldozer that will be more like what people were expecting of Hyperthreading"

Anandtech elucidates beautifully without marketing BS:

Anandtech said:
A single Bulldozer core will appear to the OS as two cores, just like a Hyper Threaded Core i7. The difference is that AMD is duplicating more hardware in enabling per-core multithreading. The integer resources are all doubled, including the schedulers and d-caches. It’s only the FP resources that are shared between the threads. The benefit is you get much better multithreaded integer performance, the downside is a larger core.

ATIorNvidia said:
Yawn.

Given your demonstrated lack of understanding it's hard to take that as a genuine expression of insipidity, rather it all shot over your head and you're as board as a toddler in a lie algebra lecture.

I can't wait till you angrily reply and tell me your on £80K at ARM! The world we live in these days... "It's not who ya know, its who ya blow." - Superhands

CmdrTobs · 21 Jan 2011 at 01:23

ATIorNvidia said:
How petty and childish. Until you can answer my question properly, don't bother replying.

No, a very serious and very relevant answer. You just don't like it as it sent your point of a ravine. If any one is being puerile it's you and your inability to handle it.

ATIorNvidia · 21 Jan 2011 at 01:26

CmdrTobs said:
Learn to decipher marketing speak. The bulldozer module is actually just a CPU core. What a few years ago became known as a 'core' when Intel started sticking 2 P4's die's in a case. I would persist with the term die, but now circuits are deposited more than one 'Cpu' (or now core) per crystal waffer so die becomes inappropriate and CPU now solely describes what plugs into a socket. The word thread outside an OS software concept is a to be strongly avoided unless the aim is to confuse and sell chips to the likes of you.

What AMD is insisting is actually a "bulldozer module", is really a bulldozer core and each of these cores are capable of executing two threads simultaneously. This is better than intels current method as essentially the integer stuff (think common processor duty stuff) is duplicated. So it seems this is more of an intermediate step between a real core and an imaginary one from microcode (intels HT)

Which goes back to what I said earlier along the lines off: "Amd bringing something out in bulldozer that will be more like what people were expecting of Hyperthreading"

Anandtech elucidates beautifully without marketing BS:

Given your demonstrated lack of understanding it's hard to take that as a genuine expression of insipidity, rather it all shot over your head and you're as board as a toddler in a lie algebra lecture.

I can't wait till you angrily reply and tell me your on £80K at ARM! The world we live in these days... "It's not who ya know, its who ya blow." - Superhands

Bulldozer has no HT, period. Two full cores in one module. One module is not one core and two threads, it is two cores and two threads.

ATIorNvidia said:
Show me any programs that uses more than 4 cores and doesn't benefit from HT, until then you don't really have a case.

*Whistles*

CmdrTobs · 21 Jan 2011 at 01:44

joeyjojo said:
I looked for some performance differences between cpus with virtualisation and without but didn't have any luck. I'm running a linux box with vmware on an e7200 atm and would like to know how it compares to say a e8xx.

Sorry I dunno. Are you saying the e8xx does not have Vanderpool and and the e7200 does? I thought Q/E6000's and higher all had Vanderpool? (a guess)

I dunno the exact performance hit for not having VM CPU support but whatever it is it is a feature that has to emulated in software so it's not just a missing feature hit, but a performance penalty.

I suspect the penalty will be enough to put me of a 'k' sandybridge and I never want to not OC so that automatically excludes me from the non-k sandybridge aswel.

In my view the sandy bridge is a possible 'fail' as an upgrade for me and types like me. All the "I the love rendering" types on here are either lying or should cough up for a C32 dual socket opteron put down these desktop 'toys'. The cost of such a system system would be a drop in the ocean compared to what they probably blow on software licences.

Mario · 21 Jan 2011 at 08:02

CmdrTobs - you sound very amd'ish actualy,you probably cant see it do you?
Its interesting how did Anandtech got those buldozer cpu's for testing? Or its just another amd strategy,aka 69xx series and how those supposed to wipe NV 5xx series?
SB fail? I just LOL'ed.Probably tests doesn't matter anymore.Would be interesting to see,how buldozer gonna properly catch few years old i7.So much talk from amd and the types of you - you said it yourself

drunkenmaster · 21 Jan 2011 at 16:04

CmdrTobs said:
That for me has the problem that it suggests elements on a pipeline can do the same operation on 2 threads simultaneously. (A ticket booth & worker never does more than one vehicle at a time in real life and in metaphor).

For that ticket booth analogy to work you need at least two sets of ticket booths A & B in series and then pose the case that while one car is doing A another car can use B. Thus making sure A and B are always utilised. Though that analogy runs rough shot over the fact that operations are pipelined and executed 'out of order' anyway.

These sorts of imprecise metaphors leave people to overestimate the possible gains of Hyperthreading.

Actually as I started reading I thought, this will be bad, but he got it spot on, yes he didn't mention the ticket booth guys are highly skilled and theres a coin slot either side so two bikes can go through, but the metaphor is fine.

THe problem with HT is, if you're only pushing cars though, you'll never see any improvement whatsoever(not quite true, the ability to hold more than one thread means if one stalls the other can keep going through, and I said stalls, which cars can do

), if you're running applications that are all like motorbikes you can get essentially a 100% speed boost.

The thing with HT is its highly dependant on what you're doing, Core 2 architecture(and, suprisingly little detail on anandtech about high level architecture on Sandybridge review not sure whats changes inside the interger cores) is essentially a 4 issue core. If you're got a thread that can use them all the core can't jam through another thread, simple as that, if an instruction is using 3 of the execution units and the second thread also wants to use 3 execution units, no deal, etc, etc.

However you're also limited to 2 threads, so if you have two threads both only needing one execution unit you're still wasting two others every clock. Sometimes you'll get no benefit, sometimes a large amount, sometimes very little, some programs consistantly get a small or large benefit, and some vary. The one thing HT can't EVER do and will never be able to is use the full width of the core at the same time by two threads. It will never exceed 4 execution units being filled with an instruction.

Where Bulldozer does well is, with only 5% die space, they jam another interger core in each module. Which will never suffer from HT issues, its always available.

There are downsides and upsides though, Bulldozer went with a 2 issue core from a 3 issue core in Phenom 2, so overall one P2 core had 3 execution units available per core, and each Bulldozer core only has 2, and a module with 2 cores can only handle 4 issues over two cores, while a dual core Phenom can do 6.

But thats not all bad news, hugely better/more aggresive prediction means those units are in use more often, better pipeline means they are in use more often, AMD reckons the 3rd issue was rarely ever being used anyway and you can easily with architectural improvements get more single thread performance from a Bulldozer core than a P2 core.

The biggest bonus of AMD's second core in the module strategy is size, nothing more or less. 5% extra die space, for an extra core in each module, is insane. With a p2 to double the core count you'd be increasing die size by a heck of a lot more than that. Keep in mind they are talking about total chip die space and cache/chipset takes up room, its 12% extra transistors or so to turn a single core module into a dual core module, thats still much much better than 100% which is the normal in these situations.

That basically means AMD will be fighting a quad core Sandybridge with a octo core Bulldozer, thats WAY smaller than an octo core Sandybridge would be.

Anyway, because its really a second core, rather than jamming a second thread through the same core, its ALWAYS available and it will give consistant performance on any given thread/instructions being used. It won't work sometimes and not other times. They are giving us a estimate that the extra core adds around 80% performance over a single core, because some of the logic is shared, but not not much.

80% extra performance all the time is better than -10 to +90% performance increase from HT, as its rarely anywhere near either extreme.

HT's very useful, real cores are better, HT on better cores is more useful than more worse cores, think quad core i7 vs hexcore p2.

Theres no real right or wrong approach here.

Intel's current design is a very wide issue core, which lends itself better to HT as many threads can't fill the execution units. Bulldozer has narrowed the execution units which would be awful for HT as the core will rarely not be fully utilised, while the narrowed core also makes it smaller, and means adding the second core adds very little in size.

Liampope · 22 Jan 2011 at 00:36

Thanks for that - very informative. Can't wait for Bulldozer benches.

CmdrTobs · 22 Jan 2011 at 16:45

drunkenmaster said:
Actually as I started reading I thought, this will be bad, but he got it spot on, yes he didn't mention the ticket booth guys are highly skilled and theres a coin slot either side so two bikes can go through, but the metaphor is fine.

THe problem with HT is, if you're only pushing cars though, you'll never see any improvement whatsoever(not quite true, the ability to hold more than one thread means if one stalls the other can keep going through, and I said stalls, which cars can do ), if you're running applications that are all like motorbikes you can get essentially a 100% speed boost.

I surrender on the metaphors, to get a clear picture I think you would need to describe a scenario equally complex as the actual situation.

drunkenmaster said:
The thing with HT is its highly dependant on what you're doing, Core 2 architecture(and, suprisingly little detail on anandtech about high level architecture on Sandybridge review not sure whats changes inside the interger cores) is essentially a 4 issue core. If you're got a thread that can use them all the core can't jam through another thread, simple as that, if an instruction is using 3 of the execution units and the second thread also wants to use 3 execution units, no deal, etc, etc.

However you're also limited to 2 threads, so if you have two threads both only needing one execution unit you're still wasting two others every clock. Sometimes you'll get no benefit, sometimes a large amount, sometimes very little, some programs consistantly get a small or large benefit, and some vary. The one thing HT can't EVER do and will never be able to is use the full width of the core at the same time by two threads. It will never exceed 4 execution units being filled with an instruction.

I think I agree. What we see as 'performance' is the integral of these optimal and sub-optimal situations over time expressed as performance boost in applications. People are reporting gains of up to ~30% and losses of ~10%. The reality is Hypertheading is an OS level CPU feature and as such is dependant on the OS scheduler. So an OS should always avoid hypthreading in any scenario that will cause a performance hit. (which is causing much confusion with people on this thread). So when people say "zomg I have noticed no slowdown 'hyperthreading on', Hypertheading does nothing but speed me up thus everything you say about hyperthreading is wrong" I sigh. I am going to take a stab in the dark here and guess on Fedora (A linux core with good schedule) you will be hard pressed to find a slow hyperthread situation.

In summary, for me as an overclocker and performance Hawk on Windows, Intel SMT is not for me. Or more precisely not something worth paying for or trading against increased overclockability.

There is one thing I disagree with: "so if you have two threads both only needing one execution unit you're still wasting two others every clock. " You call it waste, I call it a saving in watts allowing me more head room to run the active parts faster.

drunkenmaster said:
Where Bulldozer does well is, with only 5% die space, they jam another interger core in each module. Which will never suffer from HT issues, its always available.

There are downsides and upsides though, Bulldozer went with a 2 issue core from a 3 issue core in Phenom 2, so overall one P2 core had 3 execution units available per core, and each Bulldozer core only has 2, and a module with 2 cores can only handle 4 issues over two cores, while a dual core Phenom can do 6.

While I see you are referring to integer cores, this is causing great confusion and we are falling fowl of marketing speak. Lets be clear a 'bulldozer module' is just one CPU core. It is not a 'dual core' 'module'. It has continuous multi-threading as far as Integer operations are concerned but it has 1 Fetch & decode stage. That makes it a single a single core with CMT (continuous multi-threading).

If you took some sort of extreme UV laser to a Bulldozer CPU the smallest bit of micro-circuitry (probably should say nano-circuitry these days) you could isolate that could be theoretically rewired and recognised by windows as a 586 Cpu without changing the microcode would be: The parallel integer centres the FPU centre and the Fetch & decode circuitry (plus L1). That is what a 'core' is. This core is capable of CMT but it's not not a dual core device in the same respect as say a Core2Duo. It's just one CPU core with the feature of CMT. I won't allow companies to get away with talking up CMT or even SMT into being extra Cores in the sense the Core2duo had 2 cores. It's not being padantic either, this distinction matters to those of us who care about programming.

drunkenmaster said:
But thats not all bad news, hugely better/more aggresive prediction means those units are in use more often, better pipeline means they are in use more often, AMD reckons the 3rd issue was rarely ever being used anyway and you can easily with architectural improvements get more single thread performance from a Bulldozer core than a P2 core.

The biggest bonus of AMD's second core in the module strategy is size, nothing more or less. 5% extra die space, for an extra core in each module, is insane. With a p2 to double the core count you'd be increasing die size by a heck of a lot more than that. Keep in mind they are talking about total chip die space and cache/chipset takes up room, its 12% extra transistors or so to turn a single core module into a dual core module, thats still much much better than 100% which is the normal in these situations.

I agree with that I am too slightly excited about bulldozer. - but you know my stance on talking up Continuous multi-threading capability into being '2 cores' in the Conroe style.

drunkenmaster said:
That basically means AMD will be fighting a quad core Sandybridge with a octo core Bulldozer, thats WAY smaller than an octo core Sandybridge would be.

Again, I agree with what you mean. I can decode it.

When the confused masses or reivew sites, try to shove 8 assembler hand optimised AVX FPU heavy threads into a bulldozer CPU that masquerades as an octo core when really it's a quad core with CMT and thus it ends up benching a LOT slower than a octo Core 'gulftown' - type cpu people will jump to the wrong conclusions. Woe betide AMD if they carry through the marketing half-truths to the cpu box.

drunkenmaster said:
Anyway, because its really a second core, rather than jamming a second thread through the same core, its ALWAYS available and it will give consistant performance on any given thread/instructions being used. It won't work sometimes and not other times. They are giving us a estimate that the extra core adds around 80% performance over a single core, because some of the logic is shared, but not not much.

I agree and you have laid it out nicely. Though lets not get carried away sure the bulk of instructions are:

mov reg, reg
mov reg .....
push ...
...
And so on.... requiring only the ALU to work out(in fact bottle necked by mem/cache speeds but lets not go there) but the things that are fun to do and will get pressed in benches are: playing games & encoding porn.... all of which use a ton of FPU which that is not. I agree it will be a nice boost particularly for the gazillion system level process and thus threads. But I would hate to go back to the pre-Athlon days of K5/K6 vs Pentium 2 architecture performance paradigm.

drunkenmaster said:
Though

80% extra performance all the time is better than -10 to +90% performance increase from HT, as its rarely anywhere near either extreme.

HT's very useful, real cores are better, HT on better cores is more useful than more worse cores, think quad core i7 vs hexcore p2.

Theres no real right or wrong approach here.

Intel's current design is a very wide issue core, which lends itself better to HT as many threads can't fill the execution units. Bulldozer has narrowed the execution units which would be awful for HT as the core will rarely not be fully utilised, while the narrowed core also makes it smaller, and means adding the second core adds very little in size.

The right approach for me is to prioritise giving me more real cores and more pipelines with faster operations. I will keep the parts of my CPU that are idle.... well idle thank you very much

. Then I can get down to some OCing.

Though I realise that ofc this is a short term as we descend towards the limits of the ohmic junctions, Maybe the 386 was the way to go? Maybe FPU is for CUDA ,Open CL or something. Who the hell knows, not me, I don't care enough to think about it and speculate. What I do know is right now if Intel or AMD want my recession £££ they have to do the above or GTFO (Intel know this too, that's why they have realised specific and technically unnecessary gimps to the I5 & I3 vs I7)

CmdrTobs · 22 Jan 2011 at 17:31

Mario said:
CmdrTobs - you sound very amd'ish actualy,you probably cant see it do you?
Its interesting how did Anandtech got those buldozer cpu's for testing? Or its just another amd strategy,aka 69xx series and how those supposed to wipe NV 5xx series?
SB fail? I just LOL'ed.Probably tests doesn't matter anymore.Would be interesting to see,how buldozer gonna properly catch few years old i7.So much talk from amd and the types of you - you said it yourself

AMD'ish? I don't own shares in either company. To boil down an argument that is 99% justifications and theory to 'you just don't like intel' is well... ignorant.

NathanE · 22 Jan 2011 at 23:20

Two factors cause HT to improve performance.

1. Thread context switching as a result of time slicing the CPU is expensive. And whilst it occurs no "real" work is being done. Therefore if the CPU exposes 2 virtual cores it allows the kernel to schedule two threads to the CPU. Even if the CPU can't actually handle both truly concurrently the kernel at least has "enqueued" the next thread thus removing a critical performance bottleneck.

2. The CPU out-of-order pipelines are not always full. An advanced CPU design with HT can inject additional work into unused pipelines thus allowing for higher utilisation of the core between the two threads sharing it.

There are no other reasons.

Hope this helps!

CmdrTobs · 22 Jan 2011 at 23:54

NathanE said:
Two factors cause HT to improve performance.

1. Thread context switching as a result of time slicing the CPU is expensive. And whilst it occurs no "real" work is being done. Therefore if the CPU exposes 2 virtual cores it allows the kernel to schedule two threads to the CPU. Even if the CPU can't actually handle both truly concurrently the kernel at least has "enqueued" the next thread thus removing a critical performance bottleneck.

I thought 2+ core CPU's ended that bottle neck for intensive threads as you can assigning the hundreds of semi-idle threads to one core. (provided windows scheduler works well - a whole other debate)

Thanks for that, that's given me food for thought on an OS project I was thinking about.

NathanE said:
2. The CPU out-of-order pipelines are not always full. An advanced CPU design with HT can inject additional work into unused pipelines thus allowing for higher utilisation of the core between the two threads sharing it.

There are no other reasons.

Hope this helps!

Yes, I agree, will add that this does not automatically mean improved performance. Most applications don't spontaneous create threads during execution or can progress if a main/control/sync thread is stalled. so it's misleading to characterise this increased utilisation of elements as a flat out performance boost. AMD should be giving us more of a flat-out performance boost with bulldozer and CMT.

P.S This sort of thinking regarding system resources imo has led Microsoft to create cache heavy OS's under the guise of 'better utilisation of free memory'. When the end result is a OS that need 2GB of ram to feel like XP did with 1/4 that, and still thrashes the HDD to death. /end off topic rant.

jakspyder · 23 Jan 2011 at 00:30

I has sore brain.

Seriously though a very informative and interesting read. I was worried for a moment that this discussion would quickly turn into flaming and bashing but after almost falling off that cliff you guys pulled it back with some great posts and positive information from both sides. Thumbs up for good foruming

AceTK · 23 Jan 2011 at 00:50

ATIorNvidia said:
Despite all of your rambling, you've just told us right there that HT is infact useful.

The gains you get from HT are just fine and dandy so I don't know where you are getting all of this nonsense from. Even with adding 2 real cores onto a quad core (making it a 6-core) you'll never get a full 25% benefit from each core or 50% benefit from both additional cores... it's going to be around the 30-40% ballpark. Considering HT is capable of yielding a 20-30% increase.... I'd again say that's just fine and dandy!

If you compare an i3 to an E8400 you'll see the difference that HT makes. You're right in saying HT is inefficient, but in laymans terms, it's essentially a form of Out-of-order execution which is a good thing however you see it. The only problem is the premium Intel charge for it. Our day to day apps for the most part are not programmed to perfection, and that isn't going to change when apps/progams start using 4-8 cores/threads more regularly.... therefore HT will remain useful.

Bulldozer will have no HT. It's 8 cores and 8 threads period.

Just want to point out that bulldozer is not true 8 cores its 4 modules. A Bulldozer module has the following: 2 Integer cores. One 256-bit shared FPU (that can be addressed as a single 256-bit unit or 2 128-bit units per cycle). Shared front end. Shared L2 cache.

Each Bulldozer module is seen by the OS as 2 cores. The OS does not see a module, only cores.

Interlagos has 8 Bulldozer modules for a total of 16 cores.

Valencia has 4 Bulldozer modules for a total of 8 cores.

From what ive heard a module will perform better than one core + hyperthreading but less than 2 cores.

Just to throw another hand grenade into this battlefield of debate lol

DragonQ · 23 Jan 2011 at 01:12

A module is ~80% of 2 cores in terms of performance apparently, which is more than the ~65% of 1 core + HT (varies massively depending on application though of course).

pingu666 · 23 Jan 2011 at 01:36

amd are naming chips after f1 tracks now?, kinda sad i cant get a low end bulldozer now, as valencia street track is poo :/ dont want tobe reminded of that

iterlagos is great however

CmdrTobs · 23 Jan 2011 at 01:45

AceTK said:
Just want to point out that bulldozer is not true 8 cores its 4 modules. A Bulldozer module.....l

Which is what I was trying to explain to him till I gave up, though I prefer to call the bulldozer module a 'core' and then call the integer 'core' an integer centre to disambiguate, or just colloquially refer to it as oldshool ALU. Keeps language non-manufacture specific.

DragonQ said:
A module is ~80% of 2 cores in terms of performance apparently, which is more than the ~65% of 1 core + HT (varies massively depending on application though of course).

I agree once my silly brain worked out 65% is only a bit over 1 core

. But its hard to compare SMT(HT) to CMT (bulldozer) like this as bulldozer can really execute the threads 'simultaneously' if there interger instructions.

AceTK · 23 Jan 2011 at 01:50

CmdrTobs said:
Which is what I was trying to explain to him till I gave up, though I prefer to call the bulldozer module a 'core' and then call the integer 'core' an integer centre to disambiguate, or just colloquially refer to it as oldshool ALU. Keeps language non-manufacture specific.

I agree once my silly brain worked out 65% is only a bit over 1 core . But its hard to compare SMT(HT) to CMT (bulldozer) like this as bulldozer can really execute the threads 'simultaneously' if there interger instructions.

Agreed il try to refrain from using manufacturer specific terms as it can get confusing when discussing the core architecture.

ATIorNvidia · 23 Jan 2011 at 02:43

CmdrTobs said:
Which is what I was trying to explain to him till I gave up, though I prefer to call the bulldozer module a 'core' and then call the integer 'core' an integer centre to disambiguate, or just colloquially refer to it as oldshool ALU. Keeps language non-manufacture specific.

Don't assume that you need to explain that to me, I already know. I was more or less making a point that Bulldozer has no form of HT. It has two individual integer cores (incomplete or not, you can still define them as that) with two individual threads. I probably shouldn't have said "full cores".

You can call it AMD's answer/solution to HT, but you cannot call it a form of HT. Not even close. It's completely different.