Again thanks for more insight

.
I see in your profile that you are into computer science/electrical engineering so this is the stuff I presumed you studied during your education whereas my background is related to mechanical engineering. So this is all new stuff to me although it has been an interesting read

.
Yeah I did my first few degrees in Comp Sci and EEE and I worked in that area for a while. I loved microelectronics. for a while wanted to do research on future molecular electronics using graphene. I was sure that was the stopgap future before quantum circuits became feasible... but that was many years ago. I find I am more interested in pure maths so lately I've been distancing myself from engineering.
Mechanical is Interesting.

I've had a few brushes with it myself. Last term when my supervisor in the engineering department told the admissions tutor that I'll be switching to the maths department to continue my research in nonlinear equations. The admissions guy (funny fella) suggested I stay and come work with the CFD guys on the Navier-Stokes equations instead

... That would be your area. I didn't take it though.
At any rate you should be able to pick most of this up easily if you took a book on x86 assembler and went at it. none of this new-fangled MASM32/HLA crap. If you are interested you should go back and start with the 16-bit low-level assembler though. It can give you an enormous insight that no amount of reading will. After assembler you could try writing COM files in pure binary. It can be immensely liberating and insightful--suddenly everything mysterious about the lowest level of hardware/software/OS/Executable file formats/firmware and all that jazz makes sense in that one moment.
I used to have a record by writing the shortest x86 program that actually does something. It was 2 bytes long written in pure binary.

I daresay that is a tough one to beat.
I agree that in real world usage Flops values will be lower than the theoretical maximum at a given cpu speed as lots of cpu cycles are also being used to operate windows, background processes, device drivers and cached data.
Yeah. But a better question might be, "If there was only 1 task(process/thread) running would all my programs that use the FPU hit this theoretical limit?" The answer is an emphatic no.
The more crucial reason that there is a difference between theoretical performance and actual has to do with the program itself... An algorithm needs to be 100% optimised to reach the theoretical maximum. So even in a singleprogramming environment with one task you would be hardpressed to hit that theoretical max limit.
I have been reading that floating point division operation is more complicated than multiplication and takes more cycles to be implemented than the multiplication counterpart.
Yeah you're quite right. IEEE 754 spec for floating point mul is rather straightforward. You simply integer-multiply the mantissas and add the exponent. Division circuitry is somewhat more complicated and there are several methods. I think even Horner's method for dividing a polynomial by a linear binomial is used in modern CPUs.
However this does not necessarily mean it has to be slower (at least not much slower) because it can be efficiently implemented in circuit form to have similar performance. I know that they are equal speed on some processors, but I am not sure if it is slower or faster on the x86 implementation. But I would imagine the performance should be very close, while other functions like trigonometric, log etc can be a lot slower than simple FDIVs and FMULs...
Then again I suppose one could aargue that even trig functions can be implemented in a single cycle in the hardware. I knew a guy who spent a ****load on an FPGA implementation of a specialised processor because his project needed a sine in 1 clock.
I guess that's why multi-add is preferred to measure cpu flops performance. I also agree that flops also depends on architecture aswell. A cpu with better architecture will be able to process flops faster.
Again when measuing FLOPs they will try and pick an instruction that is representative of the average op rather than the worst case op or the best-case op. So it will be an instruction that is fairly simple but no simpler. It certainly won't be one of the extremely complicated ones. FMA is a reasonable middle ground. The FPU is not optimised towards a single instruction either because a typical program needs to consist of a variety of them.
As SSE includes support for flops and has a 128-bit wide register, then it can store two 64 bit numbers (double-precision)
those would be two standard doubles. The FPU itself natively uses 80 bit doubles though. And yeah, SSE began with 128-bit register stack 8 words high if I recall correctly. You'll see it commonly used because stack-organised architectures are a very efficient way of designing microprocessors for arithmetic, because it is trivial to convert an algebraic expression to reverse-polish notation (RPN) and then evaluate the expression on the stack. I always wondered why the x86 ALU wasn't stack-organised even though the FPU was. Perhaps it has to do with how those registers are often used for operations other than arithmetic. A lot of standard assembly code consists of a series of movs and int 21h (DOS interrupt/trap), int 16h (bio interrupt),int 33h (mouse), int (10h, video bios), etc. now and then there's the odd xor, not, add, sub, and often the cmp, jmp, jne, jnb etc, so I suppose it might be better.
In double-precision it is 4 flops per core per cycle (2 multi + 2add)
For example for my
[email protected]
The theoretcal maximum will be: no of cores x no of flop per core per cycle x cpu speed (total no of cycles/sec) = 4 cores x 4 flop per core x cycle x 3.4ghz = 54.4GFlops
In Intel Burn Test I always manage to get 45GFlops which is about 83% of the theoretical maximum which I think is very good.
Here is the Intel GFlops sheet for it's processors at stock speed. If you carry out the calculation mentioned above, you will get same GFlops numbers as stated in the sheet. What do you think? Are the modern processors FPU mainly designed for multi-add operations?
http://www.intel.com/support/processors/sb/cs-023143.htm#3
Also as Intel Burn Test makes use of Gaussian elimination method to solve system of linear algebraic simultaneos equations; it involves subtraction and division if I remember correctly. If the FPU is mainly designed for multi-addition, then how will the division and subtraction be implemented in the LinPack algorithm?
Good work on the calc and a very useful link you got there.
Actually FPU isn't designed just for FMA or FMADD. There are quite a large number of FPU instructions. Even the original 8087 chip (the FPU that came on a separate chip to the 8086 microprocessor) had quite a few instructions including div and subtraction:
e.g. some division instrunctions found in the 8087 include: FDIV, FDIVP, FDIVR, FDIVRP FIDIV, for instance
On top of these SEE adds division instructions like DIVSS/DIVPS for single-precision SIMD ops. Similarly there are subtract operations both float and int. (Note that often subtraction uses the same addition circuitry but uses soemthing called two's complement to do subtraction using addition. It is basically what we do: Subtraction is addition with one operand being negated.)
As for Gaussian Elimination, matrix multiply operations are typically used in practice. Naive algorithms for matrix multiply have what is called (by pure mathematicians and theoretical computer scientists) asymptotic complexity of O(n^3). However a mathematical proof exists stating that an algorithm exists (
but it is unknown) to do this in O(n^2)... Though nobody knows how to do it that quickly, algorithms faster than the standard nxn (square) matrix multiply algorithm is known.
e.g. Strassen's algorithm - f(n) ∈ O(n^2.801)
and Coppersmith-Winograd Algorithm - f(n) ∈ O(n^2.376)
Coppersmith-Winograd Algorithm is the fastest known matrix multiply algorithm. (The maths tells us that a better algorithm exist with f(n) ∈ O(n^2) but nobody has figured it out yet. If you can figure out what it is, you should inbox it to me. That way I can take the credit--, er... I mean. SO I can check if it's right.

)
(While on this topic here is a very interesting talk by Scott Aaronson during a TEDx at Caltech in honour of Feynman... He uses the Coppersmith-Winograd algorithm as an example in explaining Complexity.... A must see!)
http://www.youtube.com/watch?v=SczraSQE3MY
(Btw how do you make a link into an embedded youtube vid on this forum?)
However I have no clue if Linpack uses any of these methods. Even with a worst case, using the standard algorithm, this can be efficiently implemented on the FPU using just the standard operations.