You aren't limited to one stack
However the issue becomes bandwidth per stack, 128GB/s per stack(closer to 256GB/s in a couple years) means 4 stacks for 512GB/s bandwidth. Now latency I'd expect to improve and simpler communication with the memory. Shorter traces and less power mean less overhead, simplified memory controller which should help with some better access to the memory.
Cache would increase power usage effectively as you'd be adding an on package chunk of memory, still require off package memory and a more complex memory controller which uses more power. It's the wrong direction so very unlikely it will happen.
If AMD does it is a complete unknown, these slides aren't new and there is a lot of guessing. Currently volume might be too low to make it viable cost wise. We might well see it on firepro's or something for a generation and see something like £1.5k cards in apple mac's subsidise early production till volumes/yields get good enough to be viable in desktop gpu's.
Where it gets a little interesting is interposes are actually being made at 65nm, because they have the capacity available all over the world and interposers (effectively a circuit board but at silicon/nm trace widths rather than pcb + solder scale, so 65nm is still a magnitude smaller, and crazy cheap). The HMB can be created on 20 or 14nm and is packaged to be stuck on an interposer, and any chip can effectively be stuck on the interposer with it.
So we're at the point where it's possible to make a 65nm interposer, stick a 28nm gpu in the middle and 4 stacks of 20nm HMB around it.
It's likely(I'm not 100% clear on the tech) that you need a different memory controller for HMB, so considering the existing chips at 28nm, the work done, it's a fairly large step to make such big changes to current chips rather than make the change in the next gen.
The issue does come in that more stacks decreases yields. The biggest issues with interposers is. Make a set of chips like gpu's and say have a 80% yield, you throw the 20% away and the rest are working. Likewise you have working HMB stacks, now stick them all together on an interposer and, while you can resolder a chip on a pcb relatively easily, you can't "re-do" an interposer, if a single chip you add fails, then the whole thing is screwed. So you can have £100 of memory, £150 of gpu, £5 of interposer, all working, stick them together and not working. More things you stick together, more chance of failure. So more stacks is an issue for sure. That is where interposers start to hurt yields, the more things you stick together the worse the yields and after 5-6 chips the yields tank horribly.
So while interposers will let them stick things together to create silly fast connections between chips, very close to as if it was all on the same die, the downside is the more bits you stick together the worse the yields get. The memory stacks themselves are only up to 4 because currently the yields aren't there to do the 8 high stacks. But in 2 years you can have 4x 8 high stacks for 1TB/s of bandwidth with insanely higher yields than you'd get with 8 x 4 high stacks, same bandwidth possible but yields would drop to make prices not viable.