Looks worse than I was lead to believe - out of the load balancing you've effectively got the functional equivalent of 2 of AMD's ACE units but ticking over at a higher speed versus what is likely to be 4 ACEs on desktop Polaris parts but running slower.
Thing that will likely save nVidia here is that it is probably not possible to efficiently process that data 100% in parallel (synthetically loaded up they'd trample on nVidias implementation).
I don't think it was ever going to be a perfect solution with Pascal in truth but it looks to be more than capable to me, at least on paper.