Sorry but this made my day.

WingZero30 · 8 Apr 2011 at 15:07

Freddie1980 said:
If you really want to smash his world apart you should look at video cards which is the most important piece of hardware for gaming. AMD have said there latest video cards are 10x more powerful then console video chips and the processing power of a HD6990 towers over the cell CPU where its throughput is measured in Tflops where as cell is measured in Gflops.

http://www.newtechnology.co.in/amd-radeon-hd-6990-price-in-india/

http://www.hardocp.com/images/articles/1299536835FpEmksdSXb_1_3_l.gif

HD6990

Single-Precision: 5.10 TFlops
Single-Precision (overclocked): 5.40TFlops

Double-Precision: 1.27TFlops
Double-Precision (overclocked): 1.37TFlops

Cell Processor

Single-Precision: 204.8GFlops

Double-Precision: 102.4GFlops (Though generally it only seems to manage 20.8GFlops)

Since it is said that Cell is built more like a gpu, so again comparing the potential power of both cell and hd6990;

HD6990 is 25 times more powerfull than cell in single-precision calculations

HD6990 is 12 times more powerfull than cell in double-precision calculations

Impressive :cool:

Any one who has a good cpu coupled with 2 x 6950 or 2 x 570; they never need to look back at PS3 or XBOX360 ever

.

Actually any one with a good cpu coupled with atleast HD5770 or GTX260 and/or above; don't bother with console gaming :cool:

.

Edit: If we go by Cell's real world output of 20.8GFlops in double-precision, then 6990 overclocked is 65 times more powerful. Awesome performance.

ewarwoowar · 8 Apr 2011 at 15:25

arknor said:
you come across as an intel fan boy :S

In what way? Becuase I mentioned SB in my argument or that I brought up the AMD make the cpu thing?

I just go with Intel becuase they do infact do better than AMD.

Also Wingzero could you go check my post and see where my calculations are going wrong in getting the Flops per core per cycle?

ThomasTheTankEngine · 8 Apr 2011 at 16:05

WingZero30 said:
http://www.newtechnology.co.in/amd-radeon-hd-6990-price-in-india/

http://www.hardocp.com/images/articles/1299536835FpEmksdSXb_1_3_l.gif

HD6990

Single-Precision: 5.10 TFlops
Single-Precision (overclocked): 5.40TFlops

Double-Precision: 1.27TFlops
Double-Precision (overclocked): 1.37TFlops

Cell Processor

Single-Precision: 204.8GFlops

Double-Precision: 102.4GFlops (Though generally it only seems to manage 20.8GFlops)

Since it is said that Cell is built more like a gpu, so again comparing the potential power of both cell and hd6990;

HD6990 is 25 times more powerfull than cell in single-precision calculations

HD6990 is 12 times more powerfull than cell in double-precision calculations

Impressive

Any one who has a good cpu coupled with 2 x 6950 or 2 x 570; they never need to look back at PS3 or XBOX360 ever.

Actually any one with a good cpu coupled with atleast HD5770 or GTX260 and/or above; don't bother with console gaming.

Edit: If we go by Cell's real world output of 20.8GFlops in double-precision, then 6990 overclocked is 65 times more powerful. Awesome performance.

PC users have the option to buy two 6990's, so you might as well double all those figures....... just to really drive the point home!

WingZero30 · 8 Apr 2011 at 17:34

ewarwoowar said:
Wingzero I have been researching Floating points and doing some calculations myself.

This has all really interested me

How did you work out the flops per core per cycle values for both the SB and PS3?
I have done an IBT at 4.7ghz and received 65Gflops on each run. So thats 65Flops per cycle for all cores so essential 65/4 would be 16.25 in double precision? As you stated IBT is infact double precision so the flops per core per cycle would be 32.5 instead of 16?

So essentially by doing your calculations but with my value in single precision it would be:

I'm probably and most likely doing something wrong, user error tbh

Would be nice to chat with you a little more.

Before the introduction of SandyBridge and AVX instructions, Core2 (Q6600, E8400, etc)and Nehalem processors (i3,i5, i7) made use of SSE instructions (SSE1,2,3,3S,4.1,4.2) to carry out FLOP calculations. SSE is part of SIMD instructions set which is also used in Cell cpu. So both the desktop CPUs and PS3 Cell broadband processor compute Floating Point operations in similar numbers in a cycle in single-precision and double-precision.

http://en.wikipedia.org/wiki/SIMD#Hardware

from Wikipedia

Small-scale (64 or 128 bits) SIMD has become popular on general-purpose CPUs in the early 1990s and continuing through 1997 and later with Motion Video Instructions (MVI) for Alpha. SIMD instructions can be found, to one degree or another, on most CPUs, including the IBM's AltiVec and SPE for PowerPC, HP's PA-RISC Multimedia Acceleration eXtensions (MAX), Intel's MMX and iwMMXt, SSE, SSE2, SSE3 and SSSE3, AMD's 3DNow!, ARC's ARC Video subsystem, SPARC's VIS and VIS2, Sun's MAJC, ARM's NEON technology, MIPS' MDMX (MaDMaX) and MIPS-3D. The IBM, Sony, Toshiba co-developed Cell Processor's SPU's instruction set is heavily SIMD based. NXP founded by Philips developed several SIMD processors named Xetal. The Xetal has 320 16bit processor elements especially designed for vision tasks.

Modern graphics processing units (GPUs) are often wide SIMD implementations, capable of branches, loads, and stores on 128 or 256 bits at a time.

Future processors promise greater SIMD capability: Intel's AVX instructions will process 256 bits of data at once, and Intel's Larrabee graphic microarchitecture promises two 512-bit SIMD registers on each of its cores (VPU - Wide Vector Processing Units) [although as of early 2010, the Larrabee project was canceled at Intel].

SSE has 128-bit wide register.

Single-Precision Floating point numbers are 32-bit wide

Double-Precision Floating point numbers are 64-bit wide

So SSE register can store 4 single-precision floating point numbers and 2 double-precision floating point numbers.

Modern processors with SSE instructions can carry out a mutliplication+addition operation (2 Flop) on a single floating point number per core in a cycle. This is 2 Flop per core in a cycle.

So in Single-Precision: 2 Flop x 4 Floating Point numbers = 8Flop or (4 multiplications + 4 Additions) per core in a cycle

Double-Precision: 2 Flop x 2 Floating Point numbers = 4Flop or (2 multiplications + 2 Additions) per core in a cycle.

However with the introduction of Sandy Bridge and AVX (Advanced Vector Extensions), the 128-bit SSE register is expanded to 256-bit.

This means you can store eight 32-bit numbers or four 64-bit numbers in the 256-bit register.

So in Single-precision

2 Flop x 8 Floating Point Numbers = 16 Flop or (8 multiplications + 8 additions) per core in a cycle

Double-Precision

2 Flop x 4 Floating point numbers = 8 Flop or (4 multiplications + 4 additions) per core in a cycle.

As Sandybridge is a quad core having 4 physical cores, so this translates into;

Single-Precision: 16Flop x 4 cores = 64Flop per cycle

Double-Precision: 8 Flop x 4 cores = 32Flop per cycle

[email protected] means it can operate 4.7billion cycles in a second. Hence in;

Single-Precision: 64 Flop per cycle x 4.7 billion cycles in a second = 300.8 billion Flops or 300.8GFlops

Double-Precision: 32 Flop per cycle x 4.7 billion cycles in a second = 150.4 billion Flops or 150.4GFlops

In the case of PS3 Cell [email protected]:

It has 1 PPE (Power Processor Element) and 8 SPEs (Synergistic Processing Elements).

A PPE can perform can perform 2 Flop in a cycle in double-precision and 8 Flop in a cycle in single-precision.

So that is 8 Flop per cycle x 3.2 billion cycles = 25.6 billion Flops or 25.6GFlops in single-precision and 2 Flop per cycle x 3.2 billion cycles = 6.4 billion Flops or 6.4GFlops in double-precision

Each SPE (128-bit wide register) can also perform 25.6GFlops in single-precision and 12.8GFlops in double precision

http://en.wikipedia.org/wiki/Cell_(microprocessor)#Architecture

from Wikipedia

Each PPE can complete two double precision operations per clock cycle using a scalar-fused multiply-add instruction, which translates to 6.4 GFLOPS at 3.2 GHz; or eight single precision operations per clock cycle with a vector fused-multiply-add instruction, which translates to 25.6 GFLOPS at 3.2 GHz

Each SPE is composed of a "Synergistic Processing Unit", SPU, and a "Memory Flow Controller", MFC (DMA, MMU, and bus interface).[36] An SPE is a RISC processor with 128-bit SIMD organization[32][37][38] for single and double precision instructions

At 3.2 GHz, each SPE gives a theoretical 25.6 GFLOPS of single precision performance.

Single-Precision: (1PPE + 7SPE) x 25.6GFLops = 204.8GFlops

Double-Precision 7SPEs x 12.8GFlops = 89.6GFlops

1PPE x 6.4GFlops = 6.4GFlops

Total = 89.6 + 6.4 = 96GFlops

However assuming if PPE is only used to control the SPEs and the 8th SPE is also enabled, then you will end up with 8SPEs x 12.8GFlops = 102.4GFlops in double-precision.

To make use of AVX instructions, I believ you need to have Windows7 SP1 installed and latest version of Intel Burn test/LinX to take advantage of AVX instructions.

Since I assume you aren't using SP1, then SandyBridge will only be makng use of SSE instructions.

In that case [email protected] means the maximum theoretical GFlops value you can get in IBT is 75.2GFlops (Double-Precision). Notice this is half of 150.4GFlops with the inclusion of AVX.

You are getting 65GFlops in IBT which is still a very high value and is 65/75.2 = 86.4% of the theoretical maximum.

To get 75.2GFlops:

No of cores x no of Flop per core per cycle x cpu speed = 4 cores x 4 Flop x 4.7Ghz = 75.2GHz

To get 65GFlops:

No of cores x no of Flop per core per cycle x cpu speed = 4 cores x 3.5 Flop x 4.7Ghz = 65GFlops.

However you can't get 3.5Flop and the number will likely be a whole number. Getting 65 GFlops just means that you can't utilise all of the cpu cycles as many of them are also being taken by running of windows, background processes, devce drivers, cached data etc coupled with RAM latencies.

With the inclusion of AVX instructions, you should be getting 65 x 2 = 130GFlops out of a possible 150.4GFlops as discussed above

.

ewarwoowar · 8 Apr 2011 at 18:33

I have SP1 but not the newest IBT

I have just got a new motherboard so have to go and sort it all out aswell so no gaming or anything tonight.

Thanks for all the information! Love learning stuff like this. I will probably give it all a read over again during the OS install later

WingZero30 · 8 Apr 2011 at 19:03

ewarwoowar said:
I have SP1 but not the newest IBT

I have just got a new motherboard so have to go and sort it all out aswell so no gaming or anything tonight.

Thanks for all the information! Love learning stuff like this. I will probably give it all a read over again during the OS install later

Have a look through this thread. It may be of some help. As you can see the person is getting 120GFlops on his i5 2500k:

http://www.overclock.net/12200303-post744.html

powelly · 8 Apr 2011 at 21:10

people like this drive me absolutely insane, all my mates used to be like this but now i have quite literally driven the the truth deep down into all of their brains that they actually half understand it and believe me :cool:

Dark_Angel · 8 Apr 2011 at 22:58

Stupot... that is literally one of the best images on the internet. Ever.

Freddie1980 · 8 Apr 2011 at 23:46

I seem to remember the Cell Broadband engine has 7 SPE's but one is dedicated to running the OS so when it comes to running games it only 6 SPE engines to do the work.

Cr416 · 9 Apr 2011 at 21:07

ewarwoowar said:
I am now compiling a document I am going to place on his desk as he is one of those people that will talk crap to make it sound like he's won and show him how low down AMDs most expensive CPU is in comparison to its rival. (I'm not hating on AMD by the way, I am eager to see how their BD comes out)

Praise AnandTech for their big list of CPU Benchmarks

/Fun day at work

This made me lol. I feel for you I have mates/colleagues who are exactly the same, apparently they know everything then you prove them wrong then they go crazy and you actually have to prove it to them to get your point across with evidence. Good on you mate!

I think another point to mention to him is the fact that the PS3 runs on an old nvidia 7800 GPU lol.

EggCustard · 10 Apr 2011 at 02:09

I had similar times when walking into PC World. You would think their sales staff should know what they're selling but obviously not.

Lots of people finish IT degrees and don't know what they're talking about, sad but true. Not sure what you called them in the past but now the young ones are blessed "the Wikipeadia generation"

fail at the internets · 10 Apr 2011 at 09:58

arguing about this sort of stuff at work.

MOOGLEYS · 10 Apr 2011 at 17:07

The PS3 and Xbox 360 were very advanced when they were released. They were on par with Pc's and exeeded them in a lot of cases. The problem is that they have not advanced any further apart from cosmetic mods. The pc has been developing for the past 6 years so it is expected to have suprassed anything based on hardware that is 6+ years old.
I'm not really sure how anyone could not see this tbh. The biggest + for the consoles is the user base is huge and the developers have had a unified system to develop for..

OLDPHART · 10 Apr 2011 at 17:26

MOOGLEYS said:
The PS3 and Xbox 360 were very advanced when they were released. They were on par with Pc's and exeeded them in a lot of cases. The problem is that they have not advanced any further apart from cosmetic mods. The pc has been developing for the past 6 years so it is expected to have suprassed anything based on hardware that is 6+ years old.
I'm not really sure how anyone could not see this tbh. The biggest + for the consoles is the user base is huge and the developers have had a unified system to develop for..

Its Ground Hogs Day ---> April the 1ST. :eek:

xsistor · 10 Apr 2011 at 22:24

WingZero30 said:
@OP

You should show him this post and he will have no answer to back up his claim.

If your SandyBridge CPU is at 4.7GHz then theoretically with the inclusion of AVX Instructions which double Flops;

Single-Precision (32 bit): no of cores x Flops per core per cycle x cpu speed = 4 x 16 x 4.7 = 300.8GFlops

Double-precision (64bit): no of cores x Flops per core per cycle x cpu speed = 4 x 8 x 4.7 = 150.4GFlops

https://modelingguru.nasa.gov/message/6475#6475

PS3 Cell Broadband CPU

Single-Precision (32 bit): no of cores (1PPE + 7 SPEs) x Flops per core per cycle x cpu speed = 8 x 8 x 3.2 = 204.8GFlops

Double-precision (64bit): no of cores (1PPE + 7 SPEs) x Flops per core per cycle x cpu speed = 8 x 4 x 3.2 = 102.4GFlops

Since 1 SPE is disabled, so you have (1PPE + 7SPE) x 25.6GFlops = 204.8GFlops (Theoretical maximum value) in single-precision.

Double-Precision = 102.4GFlops which is basically half of single-precision.

Cell is only good in single-precision (even then it gets beaten by Sandy Bridge) where as it performs poorly in double-precision which is the speciality of PCs. So PC cpus can process very large numbers with more decimal points compared to Cell. Plus PC cpus are king in multi-tasking.

Hence

SandyBridge destroys Cell

The problem with this though is taht you're comparing a very general purpose processor (the CPU) with a pure vector processor...

So the GFlop indication is highly misleading. In essence you're pitting ONLY the FPU of the Sandybridge against the entire Cell processor of the PS3. (Indeed, floating point calculations were never a native part of the x86 CPU and only made it in as an integrated subsystem in the 486 DX... The CPU is architected such that natively it deals with integer arithmetic and its performance measure is more accurately described using MIPs rather than FLOPs. It is only the FPU (math co-processor) of the CPU that actually deals with floating point arithmetic.)
Thus the PS3's processor is a lot weaker than the Sandybridge CPU than is implied by these numbers. They have completely different architectures and therefore the comparison obscures the real performance difference.

For another the x86 (and x64) FPUs are designed to internally compute 80 bit doubles whereas cell processors and GPUs generally are designed to deal with Singles.

It would be better to compare the cell processor to a comparable GPU as it is otherwise misleading. The PS3's processor is about as powerful as dual 8600 GT in terms of raw performance.

Here are some numbers in MIPS:
PS3 Cell BE (PPE only) 10,240 MIPS
Intel Core i7 Extreme Edition 990x 159,000 MIPS

WingZero30 · 10 Apr 2011 at 22:56

xsistor said:
The problem with this though is taht you're comparing a very general purpose processor (the CPU) with a pure vector processor...

So the GFlop indication is highly misleading. In essence you're pitting ONLY the FPU of the Sandybridge against the entire Cell processor of the PS3. (Indeed, floating point calculations were never a native part of the x86 CPU and only made it in as an integrated subsystem in the 486 DX... The CPU is architected such that natively it deals with integer arithmetic and its performance measure is more accurately described using MIPs rather than FLOPs. It is only the FPU (math co-processor) of the CPU that actually deals with floating point arithmetic.)
Thus the PS3's processor is a lot weaker than the Sandybridge CPU than is implied by these numbers. They have completely different architectures and therefore the comparison obscures the real performance difference.

For another the x86 (and x64) FPUs are designed to internally compute 80 bit doubles whereas cell processors and GPUs generally are designed to deal with Singles.

It would be better to compare the cell processor to a comparable GPU as it is otherwise misleading. The PS3's processor is about as powerful as dual 8600 GT in terms of raw performance.

Thanks for further insight :cool:

.

Yes I understand that only the Floating Point Unit (FPU) of the desktop PC cpu deals with floating point calculations involving real numbers and is originally a native integer calculations processor and I should have mentioned it early on.

Also I did emphasise (can't remember where lol

)that PC CPUs mainly deal with double-precision calculations whereas Cell is designed more like a GPU and it's strenght lies in single-precision calculations.

Edit: Can I ask you how many FLOP a single core pc cpu can carry out per cycle on a single floating point number? Afaik it is 8 Flops in single-precision and 4 in double-precision. Also afaik it is fused multi-add. How does FPU calculate divisions and subtractions?

8600GT has shader processing rate of 114.2GFlops. So a dual 8600GT will have 2 x 114.2GFlops = 228.4GFlops which I assume you are referring to as raw power and which I also presume is single-precision.

http://en.wikipedia.org/wiki/GeForce_8_Series#GeForce_8500_and_8600_Series

If so this shows that my calculated result of 204.8GFlops was correct for CELL in single-precision which is very close to 228GFlops

.

My 5850 has single-precision max value of 2088GFlops or 2.088TFlops. This means my stock 5850 is 10x more powerful than Cell

.

http://en.wikipedia.org/wiki/Compar...ocessing_units#Evergreen_.28HD_5xxx.29_series

xsistor · 11 Apr 2011 at 00:16

WingZero30 said:
Thanks for further insight.

Yes I understand that only the Floating Point Unit (FPU) of the desktop PC cpu deals with floating point calculations involving real numbers and is originally a native integer calculations processor and I should have mentioned it early on.

Also I did emphasise (can't remember where lol )that PC CPUs mainly deal with double-precision calculations whereas Cell is designed more like a GPU and it's strenght lies in single-precision calculations.

Edit: Can I ask you how many FLOP a single core pc cpu can carry out per cycle on a single floating point number? Afaik it is 8 Flops in single-precision and 4 in double-precision. Also afaik it is fused multi-add. How does FPU calculate divisions and subtractions?

8600GT has shader processing rate of 114.2GFlops. So a dual 8600GT will have 2 x 114.2GFlops = 228.4GFlops which I assume you are referring to as raw power and which I also presume is single-precision.

http://en.wikipedia.org/wiki/GeForce_8_Series#GeForce_8500_and_8600_Series

If so this shows that my calculated result of 204.8GFlops was correct for CELL in single-precision which is very close to 228GFlops .

My 5850 has single-precision max value of 2088GFlops or 2.088TFlops. This means my stock 5850 is 10x more powerful than Cell .

http://en.wikipedia.org/wiki/Compar...ocessing_units#Evergreen_.28HD_5xxx.29_series

My theoretical performance of hte Cell stated as being roughly 2x 8600 GT was actually rough estimate so your result is probably more accurate.

It is in the 200 gflop single precision ballpark at any rate. 204ish gflops looks about right.

I haven't kept up with the recent developments in microprocessor architectures (these things change way too fast for my tastes tbh. I kinda moved more into low-leveler ICs some time ago and haven't looked back since. My assembly language programming is therefore also quite rusty. Last time I wrote anything in assembler it was a 16-bit embedded RISC processor for a robot (And there every instruction was guaranteed to take 1 clock cycle). Programming x86 processors was well before that, so take whatever I say next with a grain of salt....

If I recall correctly the x86 FPU uses a very different architecture from the ALU. It is a stack-organised machine (like those good old Zilog Z80ks) with a set of 80bit floating point registers conforming to IEEE 754/854 (in comparison most programming languages like ANSI standard C and C++ implement 64 bit doubles and 32 bit floats(singles)). The actual number of clock cycles to FPU instruction can vary based on whether ops are directly off the (n) and (n-1) elements on the register stack or whether fetched from memory (table lookups,etc). it also varies based on processor architecture. Ops also vary in complexity (things like FSIN and FCOS took somewhere between 30 and 100 or so cycles on a modern processor).

As for FMA/FMADD I think these are some standard ops used when measuring theoretical performance flops because most processors implement some form of it. Actual realworld FLOPs varies of course based on a number of things, like cache hits/misses, etc., not least of which is the actual instructions used. If you wrote a program which did nothing but executed FLDL2E, FTCOS in a loop it will be carrying out significantly fewer ops and therefore have a much low GFLOP measure than if you were doing FMA, FADD all day.
Also it varies based on programming. Assembly programming of hte FPU, required quite some ingenuity in how you interleave CPU/ALU operations with FPU operations such that neither blocks the other.

There are crap tonnes of floating point ops including integer variants like FIMUL and FIDIV. FMUL can be performed in 2 cycles if I recall correctly. I'd imagine FDIV would be something similar. I am pretty sure the architecture of the processor will determine how many cycles it takes, making actual figures difficult in real world operations. Benchmarks are often more useful in this regard. something like the high end i7 extremes will pull off like 40 ops per clock (not FPU ops tho) bestcase while a Core 2or Core Solo wouldn't manage half that many.

WingZero30 · 11 Apr 2011 at 00:59

xsistor said:
My theoretical performance of hte Cell stated as being roughly 2x 8600 GT was actually rough estimate so your result is probably more accurate. It is in the 200 gflop single precision ballpark at any rate. 204ish gflops looks about right.

I haven't kept up with the recent developments in microprocessor architectures (these things change way too fast for my tastes tbh. I kinda moved more into low-leveler ICs some time ago and haven't looked back since. My assembly language programming is therefore also quite rusty. Last time I wrote anything in assembler it was a 16-bit embedded RISC processor for a robot (And there every instruction was guaranteed to take 1 clock cycle). Programming x86 processors was well before that, so take whatever I say next with a grain of salt....

If I recall correctly the x86 FPU uses a very different architecture from the ALU. It is a stack-organised machine (like those good old Zilog Z80ks) with a set of 80bit floating point registers conforming to IEEE 754/854 (in comparison most programming languages like ANSI standard C and C++ implement 64 bit doubles and 32 bit floats(singles)). The actual number of clock cycles to FPU instruction can vary based on whether ops are directly off the (n) and (n-1) elements on the register stack or whether fetched from memory (table lookups,etc). it also varies based on processor architecture. Ops also vary in complexity (things like FSIN and FCOS took somewhere between 30 and 100 or so cycles on a modern processor).

As for FMA/FMADD I think these are some standard ops used when measuring theoretical performance flops because most processors implement some form of it. Actual realworld FLOPs varies of course based on a number of things, like cache hits/misses, etc., not least of which is the actual instructions used. If you wrote a program which did nothing but executed FLDL2E, FTCOS in a loop it will be carrying out significantly fewer ops and therefore have a much low GFLOP measure than if you were doing FMA, FADD all day.
Also it varies based on programming. Assembly programming of hte FPU, required quite some ingenuity in how you interleave CPU/ALU operations with FPU operations such that neither blocks the other.

There are crap tonnes of floating point ops including integer variants like FIMUL and FIDIV. FMUL can be performed in 2 cycles if I recall correctly. I'd imagine FDIV would be something similar. I am pretty sure the architecture of the processor will determine how many cycles it takes, making actual figures difficult in real world operations. Benchmarks are often more useful in this regard. something like the high end i7 extremes will pull off like 40 ops per clock (not FPU ops tho) bestcase while a Core 2or Core Solo wouldn't manage half that many.

Again thanks for more insight

.
I see in your profile that you are into computer science/electrical engineering so this is the stuff I presumed you studied during your education whereas my background is related to mechanical engineering. So this is all new stuff to me although it has been an interesting read

.

I agree that in real world usage Flops values will be lower than the theoretical maximum at a given cpu speed as lots of cpu cycles are also being used to operate windows, background processes, device drivers and cached data.

I have been reading that floating point division operation is more complicated than multiplication and takes more cycles to be implemented than the multiplication counterpart. I guess that's why multi-add is preferred to measure cpu flops performance. I also agree that flops also depends on architecture aswell. A cpu with better architecture will be able to process flops faster.

As SSE includes support for flops and has a 128-bit wide register, then it can store two 64 bit numbers (double-precision)

In double-precision it is 4 flops per core per cycle (2 multi + 2add)

For example for my [email protected]

The theoretcal maximum will be: no of cores x no of flop per core per cycle x cpu speed (total no of cycles/sec) = 4 cores x 4 flop per core x cycle x 3.4ghz = 54.4GFlops

In Intel Burn Test I always manage to get 45GFlops which is about 83% of the theoretical maximum which I think is very good.

Here is the Intel GFlops sheet for it's processors at stock speed. If you carry out the calculation mentioned above, you will get same GFlops numbers as stated in the sheet. What do you think? Are the modern processors FPU mainly designed for multi-add operations?

http://www.intel.com/support/processors/sb/cs-023143.htm#3

Also as Intel Burn Test makes use of Gaussian elimination method to solve system of linear algebraic simultaneos equations; it involves subtraction and division if I remember correctly. If the FPU is mainly designed for multi-addition, then how will the division and subtraction be implemented in the LinPack algorithm?

CM-Sniper · 11 Apr 2011 at 01:51

Stupot_ said:
Haha. Khaaan I think half the stuff you post I see on reddit first

In a similar vein...

*Snip*

This picture made reading this whole thread worthwhile

I do hate people like this; its just like if you don't truly know wtf you are talking about, just shut up cos you're making yourself look silly :rolleyes:

Some guy at my work thinks his 360 is more powerful than both the PC and PS3. I do try to explain, but I give up after a while of him telling me that if PC gaming was so much better they'd charge us an online subscription

Or maybe, mate, you're just getting ripped off?

xsistor · 11 Apr 2011 at 02:21

WingZero30 said:
Again thanks for more insight.
I see in your profile that you are into computer science/electrical engineering so this is the stuff I presumed you studied during your education whereas my background is related to mechanical engineering. So this is all new stuff to me although it has been an interesting read.

Yeah I did my first few degrees in Comp Sci and EEE and I worked in that area for a while. I loved microelectronics. for a while wanted to do research on future molecular electronics using graphene. I was sure that was the stopgap future before quantum circuits became feasible... but that was many years ago. I find I am more interested in pure maths so lately I've been distancing myself from engineering.

Mechanical is Interesting.

I've had a few brushes with it myself. Last term when my supervisor in the engineering department told the admissions tutor that I'll be switching to the maths department to continue my research in nonlinear equations. The admissions guy (funny fella) suggested I stay and come work with the CFD guys on the Navier-Stokes equations instead

... That would be your area. I didn't take it though.
At any rate you should be able to pick most of this up easily if you took a book on x86 assembler and went at it. none of this new-fangled MASM32/HLA crap. If you are interested you should go back and start with the 16-bit low-level assembler though. It can give you an enormous insight that no amount of reading will. After assembler you could try writing COM files in pure binary. It can be immensely liberating and insightful--suddenly everything mysterious about the lowest level of hardware/software/OS/Executable file formats/firmware and all that jazz makes sense in that one moment.
I used to have a record by writing the shortest x86 program that actually does something. It was 2 bytes long written in pure binary.

I daresay that is a tough one to beat.

WingZero30 said:
I agree that in real world usage Flops values will be lower than the theoretical maximum at a given cpu speed as lots of cpu cycles are also being used to operate windows, background processes, device drivers and cached data.

Yeah. But a better question might be, "If there was only 1 task(process/thread) running would all my programs that use the FPU hit this theoretical limit?" The answer is an emphatic no.
The more crucial reason that there is a difference between theoretical performance and actual has to do with the program itself... An algorithm needs to be 100% optimised to reach the theoretical maximum. So even in a singleprogramming environment with one task you would be hardpressed to hit that theoretical max limit.

WingZero30 said:
I have been reading that floating point division operation is more complicated than multiplication and takes more cycles to be implemented than the multiplication counterpart.

Yeah you're quite right. IEEE 754 spec for floating point mul is rather straightforward. You simply integer-multiply the mantissas and add the exponent. Division circuitry is somewhat more complicated and there are several methods. I think even Horner's method for dividing a polynomial by a linear binomial is used in modern CPUs.
However this does not necessarily mean it has to be slower (at least not much slower) because it can be efficiently implemented in circuit form to have similar performance. I know that they are equal speed on some processors, but I am not sure if it is slower or faster on the x86 implementation. But I would imagine the performance should be very close, while other functions like trigonometric, log etc can be a lot slower than simple FDIVs and FMULs...

Then again I suppose one could aargue that even trig functions can be implemented in a single cycle in the hardware. I knew a guy who spent a ****load on an FPGA implementation of a specialised processor because his project needed a sine in 1 clock.

WingZero30 said:
I guess that's why multi-add is preferred to measure cpu flops performance. I also agree that flops also depends on architecture aswell. A cpu with better architecture will be able to process flops faster.

Again when measuing FLOPs they will try and pick an instruction that is representative of the average op rather than the worst case op or the best-case op. So it will be an instruction that is fairly simple but no simpler. It certainly won't be one of the extremely complicated ones. FMA is a reasonable middle ground. The FPU is not optimised towards a single instruction either because a typical program needs to consist of a variety of them.

WingZero30 said:
As SSE includes support for flops and has a 128-bit wide register, then it can store two 64 bit numbers (double-precision)

those would be two standard doubles. The FPU itself natively uses 80 bit doubles though. And yeah, SSE began with 128-bit register stack 8 words high if I recall correctly. You'll see it commonly used because stack-organised architectures are a very efficient way of designing microprocessors for arithmetic, because it is trivial to convert an algebraic expression to reverse-polish notation (RPN) and then evaluate the expression on the stack. I always wondered why the x86 ALU wasn't stack-organised even though the FPU was. Perhaps it has to do with how those registers are often used for operations other than arithmetic. A lot of standard assembly code consists of a series of movs and int 21h (DOS interrupt/trap), int 16h (bio interrupt),int 33h (mouse), int (10h, video bios), etc. now and then there's the odd xor, not, add, sub, and often the cmp, jmp, jne, jnb etc, so I suppose it might be better.

WingZero30 said:
In double-precision it is 4 flops per core per cycle (2 multi + 2add)

For example for my [email protected]

The theoretcal maximum will be: no of cores x no of flop per core per cycle x cpu speed (total no of cycles/sec) = 4 cores x 4 flop per core x cycle x 3.4ghz = 54.4GFlops

In Intel Burn Test I always manage to get 45GFlops which is about 83% of the theoretical maximum which I think is very good.

Here is the Intel GFlops sheet for it's processors at stock speed. If you carry out the calculation mentioned above, you will get same GFlops numbers as stated in the sheet. What do you think? Are the modern processors FPU mainly designed for multi-add operations?

http://www.intel.com/support/processors/sb/cs-023143.htm#3

Also as Intel Burn Test makes use of Gaussian elimination method to solve system of linear algebraic simultaneos equations; it involves subtraction and division if I remember correctly. If the FPU is mainly designed for multi-addition, then how will the division and subtraction be implemented in the LinPack algorithm?

Good work on the calc and a very useful link you got there.
Actually FPU isn't designed just for FMA or FMADD. There are quite a large number of FPU instructions. Even the original 8087 chip (the FPU that came on a separate chip to the 8086 microprocessor) had quite a few instructions including div and subtraction:
e.g. some division instrunctions found in the 8087 include: FDIV, FDIVP, FDIVR, FDIVRP FIDIV, for instance

On top of these SEE adds division instructions like DIVSS/DIVPS for single-precision SIMD ops. Similarly there are subtract operations both float and int. (Note that often subtraction uses the same addition circuitry but uses soemthing called two's complement to do subtraction using addition. It is basically what we do: Subtraction is addition with one operand being negated.)

As for Gaussian Elimination, matrix multiply operations are typically used in practice. Naive algorithms for matrix multiply have what is called (by pure mathematicians and theoretical computer scientists) asymptotic complexity of O(n^3). However a mathematical proof exists stating that an algorithm exists (but it is unknown) to do this in O(n^2)... Though nobody knows how to do it that quickly, algorithms faster than the standard nxn (square) matrix multiply algorithm is known.
e.g. Strassen's algorithm - f(n) ∈ O(n^2.801)
and Coppersmith-Winograd Algorithm - f(n) ∈ O(n^2.376)
Coppersmith-Winograd Algorithm is the fastest known matrix multiply algorithm. (The maths tells us that a better algorithm exist with f(n) ∈ O(n^2) but nobody has figured it out yet. If you can figure out what it is, you should inbox it to me. That way I can take the credit--, er... I mean. SO I can check if it's right.

)

(While on this topic here is a very interesting talk by Scott Aaronson during a TEDx at Caltech in honour of Feynman... He uses the Coppersmith-Winograd algorithm as an example in explaining Complexity.... A must see!)
http://www.youtube.com/watch?v=SczraSQE3MY
(Btw how do you make a link into an embedded youtube vid on this forum?)

However I have no clue if Linpack uses any of these methods. Even with a worst case, using the standard algorithm, this can be efficiently implemented on the FPU using just the standard operations.