I wonder how HBM will affect cooling. Now the heat has to travel through a load of RAM dies before it gets to the heatsink, surely?
Seems they would have to push down the power consumption a lot because of this. Maybe that's why it's 1ghz clock rather than higher?
Clock speed is partially down to power but it's more about efficiency.
There are lots of individual reasons, going off package means pins have a minimum size, increased bus and more connections to memory off package means more pins and traces. There is literally a limit on how many non silicon scale connections you can make. When you move to the silicon scale at 40nm or below you can fit in 10 times as many connections.
HBM could easily be 4Ghz and 256bit, but it would take more power. By moving onto an interposer 256 connections or 1024 connections is pretty much no different, they can fit way way more connections than that. Wider bus and lower memory speed will always be more efficient. There will be higher clock speed/higher bandwidth versions in the future but again it's basically down to what is required. If you can saturate the GPU with bandwidth with 4 stacks where they run at 1Ghz, then using 2Ghz is just wasting power. It will move to 2Ghz when they need that much bandwidth.
In terms of cooling, it's surprising how easy it is, how stacked chips work is using through silicon via's(TSV's), essentially it's a copper connection vertically through the chip that connects the bottom layer to each higher layer, there can be thousands of them because well, each connection can basically be 20nm wide now. There was/is a real temperature difference between the bottom and top layer, but they add in dummy TSV's, giving in effect a copper cooling connection from top through bottom. When adding 5-10% dummy TSV's then the temperature difference from top to bottom goes from around 25C to about 5C, so it's very effective.
Ultimately stacks aren't limited by clock speed, cooling or power. It's mostly about tuning a gpu to the right number of stacks/memory/bandwidth. There is also the manufacturing size, they need to produce HBM's in high volume to make them not too expensive, so they need to suit multiple uses, again lower power/more efficiency will lend HBM stacks to mobile devices, gpus, laptops, cpu, will likely be used as basically on die cache as well in the future. 4ghz higher power 1TB/s stacks would prove useful pretty much only in gpu's, so volume would drop drastically and price would increase greatly.
When you can to an extent just drop on another stack of efficient low power HBM to provide more bandwidth and memory high power single stacks are possible but just not really required.