There is no times changing, modules aren't two dies stuck together, and sticking two dies together these days is significantly less "brute force" method than previously both AMD and Intel do it, 16 core AMD server chips are 2x 8 core dies. When Intel did it years ago, they did it without a dedicated interconnect, nor bus, nor real speed, the communication and latency hit from one core talking to the other or a thread moving across it was HUGE because it really was just two dies stuck together with a basic connection, a HT connection compared to the first way of sticking dies together, was ridiculously better, and what Intel/AMD do now to connect dies makes the old school method look very antiquated. They are also full 8 cores no matter what people say, the shared scheduler is nothing more than a limitation of die size on 32nm. There are sacrifices to be made, Intel have also taken up the module design with Atom in their most recent chip design, and WILL do so in the future of desktop.
Its very simple, take a gpu as an example, you start off with 2 pipelines(equate them to shaders though they aren't really), then 4, 8, 16, 32, 64, 128, 256(well 240), 512, etc, etc, etc.
Now equate that to any work place. When you have 4 workers, one boss and no inbetween is fine, communication is quick and simple, 4 people can all fit in the boss's office to hear instructions then they can get on with their work. When you reach 512 shaders or workers, imagine them all being instructed individually by one boss... its simple inefficient, by the time you've told person 512 what he should be doing, the people 1 through 500 are all finished and waiting for something else to do.
SO you have managers(clusters) and you subdivide the work, and at various stages you keep subdividing work to keep communication balanced between efficient and not too many subdivisions.
With AMD we had clusters, then the clusters themselves got split into two separate halfs because there were too many clusters, then we got geometry engines doubled, and doubled again for the next generation.
Modules is simply that, its utterly inefficient to contact 8 different cores at the kinds of power, bandwidth and transistor count that entails, at some stage you have to say, this data path will be doubled, but have 2 cores at either end, because transistors aren't just used for the data, its like a motor way, to make a road you need a pavement, power lines go along them, banks get built to absorb noise, lighting network is laid down, phone cables, emergency stuff, hard shoulder. That is all the same amount of work if there are 4 lanes, or 1 lane, its more efficient to have less fatter communication lines in a cpu, than many smaller ones.
Modules WILL have for Intel, and AMD, and in the future it will happen more. Suddenly a module will be 2 cores, but there will be a cluster, with 4 modules in, and 4 clusters, thats how we'll get to more cores, and beyond that, we'll get 2 cores, in 4 modules, in 4 clusters, which are in two compute units. This is how chips have and always will work.
This is what cores are for one thing, cpu's started off with one integer pipe, then you have 3 integer pipes, then you had two cores for 2 sets of integer pipes rather than 6 int pipes in one core. Its all the same principle, and always has been.
With Bulldozer people ignored stuff said LONG before it was released, firstly Intel dominates the market and controls the most used compilers and is a distinct disadvantage for any new chip from AMD for a significant portion of time, likewise the first chip in any new architecture is usually pretty terrible. Memron/Yonah were pretty terrible compared to the first desktop Core architecture, which are pretty crap compared to now. Bulldozer HAS improved over time to better optimisation of software, OS and individual pieces.
The most crucial thing that people are ignoring, is simply that Bulldozer was designed to be an HSA compliant APU of the future, it integrates MANY idea's like reducing FPU power because ARM, AMD and Intel are all moving towards gpu offloading of FPU calculations, which already happens, is happening more and is gaining the software stack to push to optimise in the industry.
I said at the time, before and after release, a year before release. You can't plough what is likely in excess of a couple of billion into a design for an architecture and optimise it for the software available on the day of release, you optimise it for the software coming in the years afterwards.
Look up the HSA foundation, the sheer and absolute industry support, ARM is well on board, AMD is onboard, the first HSA chips are being launched this year, HSA might be a very big reason AMD got both the PS4 and Xbox(supposedly) win, which is already being predicted to basically increase AMD's quarterly revenue by around 25%, which is huge for one project and one win.
Bulldozer was not in any way a chip designed for 2010, at all, in any way, nor the software available in 2010, anyone with half an ounce of sense can see that spending a couple billion of cash you can barely afford on software of 2010 when the entire industry is moving towards accelerating as much as possible, reducing power and interchangable IP.