I do think gcn 4.0 is a big improvement as they have changed a lot in the gpu and hopefully they see a lot of the bottlenecks from this gen gone.
the inefficiency of compute units utilisation is mostly due to the API(dx11) not necessarily the architecture DX12 get ride of that with async compute, polaris will still have ACEs instead of a giga thread like maxwell, but polaris will have far less compute units(i think something like 2.3k SP instead of 4k SP in fury), so yeah it will be more efficient, and by the time Vega comes out hopefully Devs would be using async compute more often, especialy if pascal adds it too.