The Bus width on HBM is stacked, for example...
Standard Bus:
256Bit @ 6000Mhz = 192GB/s (2GB, 4GB, 8GB Layout)
384Bit @ 6000Mhz = 288GB/s (3GB, 6GB, 12GB Layout)
512Bit @ 6000Mhz = 384GB/s (4GB, 8GB, 16GB Layout)
HBM 2+1 Stack
256Bit x2 (512bit effective) @ 6000Mhz = 384GB/s (2GB, 4GB, 8GB Layout)
384Bit x2 (768bit effective) @ 6000Mhz = 576GB/s (3GB, 6GB, 12GB Layout)
512Bit x2 (1024bit effective) @ 6000Mhz = 768GB/s (4GB, 8GB, 16GB Layout)
HBM 4+1 Stack
256Bit x4 (1024bit effective) @ 6000Mhz = 768GB/s (2GB, 4GB, 8GB Layout)
384Bit x4 (1536bit effective) @ 6000Mhz = 1152GB/s (3GB, 6GB, 12GB Layout)
512Bit x4 (2048bit effective) @ 6000Mhz = 1536GB/s (4GB, 8GB, 16GB Layout)
This isn't even slightly how HBM works, fundamentally, on speeds or on total bandwidth.
A single 4 hi stack of HBM provides 128GB/s of bandwidth, no more no less and afaik the first production run will be 1GB per stack with it moving to higher density 2GB per stack(if required) within about 6-12 months, though the way production lines up it might be available by the time the first cards are available.
It also has a 128bit bus, so you need a 512bit memory controller on the gpu to access 4 stacks of 4 hi HBM. 4 stacks would give you 4x128GB/s bandwidth and 4GB(8GB when they start using the higher density chips).
768GB/s would require 6 stacks, a 768bit memory controller and give you 6GB or 12GB, 1024GB/s would require 8 stacks, a 1024bit memory controller and give you 8 or 16GB.
This is where I've read almost no information effectively, the memory controller on die. The biggest reason for the large power saving is sending a signal over say 10-50CM(the traces on a PCB can end up very long in the end when they travel up and down the PCB over up to 12 layers), when you send this signal over 1-2CM on package on copper traces a fraction of the width the power drops significantly. But I don't know how this effects the memory controller. I presume it simplifies the memory controller massively. It doesn't have to generate powerful signals so the input/output for power will be much smaller. I suspect they could be facing a situation where a conventional GDDR5 memory controller might be say 50mm^2 for 512bit but for HBM a 512bit controller may be say 40mm^2, or 20mm^2.... I really don't know. I've basically not seen anyone mention how it will change on that side.
Ultimately I don't know if it's viable to go beyond a 512bit memory controller yet. The reason we haven't done so without HBM is power/cost issues. 512bit mem controller eats power and space on die. At some point it becomes cheaper and easier to add more transistors for on die texture compression, or colour compression. We've had this for years, memory controller size and power cost increase till a new compression method helps you scale back the bandwidth required or a new memory tech comes along and helps scale power back.
I would be surprised if they jumped beyond 512bit/4 stacks HBM for the next generation. They might find they need to go to 768bit and 6 stacks to get to 6gb mark, or because of the delay getting to 20nm the 2GB stacks might be available and the cards might work fine with 6GB and 3 stacks. It's worth noting that effectively with every additional chip you stick together on an interposer effective yields go down and cost goes up, so ultimately they'll keep it as tight as possible on bandwidth/costs.
There are lots of questions that remain about HBM, because the memory is stacked is there a max thermal limit on the top chip to make sure the bottom chip is also not too hot. IE 90C might be fine for the gpu, but if the top memory chip is 90C, the bottom memory chip might be 120C. They might need the top memory chip to hit 70C max so the bottom chip doesn't hit over 90C. Will HBM overclock at all or are they locked forever at 1GHz in clock speed, how could that effect overclocking, will AMD have to include bandwidth overhead to cover overclocking.
HBM can bring benefits for certain but there could be problems, thermal, overclocking, there might be no downsides.
Either way, the bandwidth doesn't scale as simply as you think and due to cost AMD/Nvidia won't be throwing 8 stacks on an interposer to get 1024bit bandwidth any time soon.