It requires either very high bandwidth OR true parallel async compute.
I have not programmed this sort of thing yet, but from what I've read the following seems quite a reasonable explanation to me:
AMD cards use the copy queue to stream the textures while compute/3d queues are still doing actual graphics work (in parallel). NVidia cards can't copy in parallel and must do their preemption thing to implement the async behavior (i.e. pause transfer to render frames). Therefore it follows that this effect will appear when the memory bandwidth is not high enough to complete the data transfer within the time-slices NVidia allocates to copying.
In short, memory bandwidth for NVidia cards becomes more important, whereas AMD cards can get away with lower bandwidth as long as they start transferring textures in time.
I assume NVidia can fix this on 1070 by changing the driver to prioritize copying a bit (at the cost of some lower FPS during that time). Or the developers can tune this manually to get the 1070 to work perfectly. But it's just too time-consuming to do this for every single card. For example, the 1080 using the faster GDDR5X seems to be fast enough and is totally unaffected. Tuning a path for 1070 would cause the 1080 to take an FPS hit for no reason. So you need 2 different paths as a dev to get both working optimally? And what about the other models?
I don't thing people realise the importance of this. Sure 1-2 secs of worse image quality on the 1070 is not a big deal (seriously, who cares?). However, this is the kind of thing that makes tuning for AMD so much easier.
Multi-engine is hard enough already, even without having to worry about all this. In AMD cards you can just schedule whatever you need and let the ACE figure out the scheduling, not having to worry so much about every single detail. Whereas NVidia needs devs to put in time and tune these operations manually or has to do it themselves by game profiles in their drivers (not sure how well the latter will work but that's what NVidia keeps saying: we will address this stuff in software).
I find it interesting that the basic premise of devs going forward is now: 'worry about getting NVidia to work acceptably as AMD cards will sort themselves out'. For example
this DX12 dev guide (yes I know Doom is Vulkan but the multi-engine paradigm is the same among the two) actually goes as far as to spell it out:
* Choose sufficiently large batches of short running shaders.
Long running shaders can complicate scheduling on Nvidias hardware. Ensure that the GPU can remain fully utilized until the end of each batch. Tune this for Nvidias hardware, AMD will adapt just fine.
NVidia is getting into an age where they will have minor quirks like that all over the place and will rely on its mindshare for users to 'tolerate' this while they sort it out in drivers. At the same time developers will have less and less incentive to profile every single NVidia card model and to get it to work acceptably. They will choose a baseline (and I hope it's the 1060 so that things work for most users) and tune for that minimum (lower models will have degraded quality while higher models may take a bit of an FPS hit).
I really want to see how well the 1060 with its 192bit bus will cope with these scenes in Vulkan mode. It may make the problem much more visible which would confirm my suspicions (i.e. they optimised Doom for a 1080 which makes the 1070 suffer and the 1060 suffer even more).