it may not be a software limitation. Increasing the number of cores doesn't give you linear scaling, other bottle necks raise their heads and interfere. Otherwise Nvidia would just reuse the same Fermi architecture. Fiji has the same 4 shader engines delivering work to more m=cores, Hawaii had the same 4 shader engines. If the limit is in the SE then the cores are effectively sitting idle. There was also no change in the number of ROPs which are linked to the shader engine, that could also limit the throughput.
Then there is a big debate on the effectiveness of HBM when you are not bandwidth limited, which most of the the time at most resolution you aren't. People rarely bother overlooking the GPU memory much because it doesn't increase performance much, because bandwidth is not a problem currently. HBM may be solving a problem that doesn't currently exist, but AMD had to go that route in order to ct power requirements. HBM may have some other cost when the bandwidth is not advantageous. I have seen some strange benchmarks where overclocking the Fiji memory resulted in big performance gains, yet there should be way less gains than with slower GDDR5 because Fiji should be less bandwidth limited.