problem is the current generation of Bulldozer is supposed to have very similar instructions per clock to K10.5, it was meant to either match or improve in that aspect, that was one of the fundamental design parameters.
No, it wasn't, this was a SERVER chip first and foremost, designed for throughput, not IPC and also designed as an APU not a cpu on its own. The first gen is a cpu only, the 2nd gen is an APU, we probably won't see only APU's till 28/20nm and when the GPU power is more able to be utilised by windows/linux os's, which is still a little way off. Beyond quicksync/video encoding and some basic stuff there isn't a whole lot of gpu acceleration.
As said so long ago, this is a two integer pipeline wide core, vs a 3 in the old gen. That shows a HUGE efficiency improvement per pipeline in each core, it has 33% less resources and is never that much slower, and normally just about on par, through in 20% more performance from better scheduling and you mostly have 2 integer pipes outperforming 3.
Its clearly a flaw in the processor design.. They could have designated which order cores are to be used in.. currently it's 12,34,56,78 It should have been 1,3,5,7,2,4,6,8.
The benefit that the current system gives is that unused modules are shut down when not in use, so power consumption is reduced. This in turn allows the turbo mode to clock the chip higher. Regardless of these points the performance is reduced.
Whether the OS is patched or the cpu is fixed, the end result should be slightly higher multithreaded performance, at the cost of power consumption, and lower cpu clocks.
IT is and it isn't a design flaw, firstly a scheduler is incredibly fundamental to an OS, patching is NOT something done lightly, and that is why you can already see Windows 8 builds with a new scheduler in it, and no sign of it on Windows 7. There are some things you often don't patch as it can create so many problems.
Secondly, you do NOT every single time want threads to go onto different modules. Firstly Bulldozer's power gating is VERY effective, even overclocked the clock and power gating is immense, if you only had windows background processes, which will be pushing through DOZENS of threads all the time, but using almost no cpu power, in your method, where every new thread got pushed to a new module, Bulldozer would never ever power gate any modules down and its idle power would be horrific, rather than matching that of a core with less than half the transistors.
Secondly, there are situations in which sharing L2 and the same data can improve performance when two threads are within 1 module. Thirdly, if you start off with 2 threads, and put them in different modules, but you get another 3-4 threads from other programs so there is 2 threads in each module, then you would be even more likely to see the sharing of L2 do better for two threads from the same program.
In other words, a scheduler is NOT basic, its in no way cut and dry, it in no way can only ever use a new module for every new threads up to 4, that would make the chip FAR worse than it is now, its not even slightly feasable as an idea and this is why a patch for Win 7 isn't certain, maybe not even that likely(and may not be great if it is done).
There are hundreds of different types of data processing, there are thousands of scenario's, to get a chip working best in all of them is very difficult, however previous chips have been far less complex before Bulldozer.
Not only the different situations I highlighted, power saving, thread combining, the fact that usage can be constantly swapping between 2 and 8 threads and constantly moving one thread from one module to another won't help, you've got, what if you've got 2 heavy integer threads, then another program starts and its got 6 more threads, of which 4 are incredibly heavy on FPU, you'd probably be best off with one FPU heavy thread in each module and the rest spread out as best as possible. There are as I said,THOUSANDS of possibilities. The scheduler for i7 has basically been worked and improved on since Yonah IIRC, or Memron, 1-2 gen's before Conroe. Ath 64 scheduler has been worked on and tweaked for years.
It will take time, that would be true if Bulldozer came out yesterday or in 3 years.