It does make me wonder how Sup Com did it back in the day I mean this looks great and will no doubt be very efficient compared but you could get similar numbers in a sup com game which came out in 2007. 8000 unit max wasnt it?
It was one of if not thie first game to detect Multi core CPU's and then allocate calculations to each core (upto 4)
I'm not entirely sure but i think they used their own higher level extraction layer to achieve that.
Whatever they did its not conventional and it was impressive for the time.
The problem today as it was then is the allocation threading in DX is pretty much fixed, you can allocate AI Calculations to thread 1 and various real time environment entities to thread 2 and 3 and 4 and 5....
If one of those threads starts to pass about 4,000 Draw Calls the hand shake between the CPU thread and GPU starts to get choked up, so it forms a que, which increases latency, if the latency increases then the GPU needs to wait on the CPU, if the GPU needs to wait on the CPU it will have to slow down to allow the CPU to catch up... the end result is your Frame Rates drop.
Threading isn't so much a problem as it is the fact that its fixed, if you have 6.000 Batches on one thread its slowing the system down, the system isn't intelligent enough to move some of those batches to a thread that is only loaded with 800 Batches.
Then there's the extraction layer, the system thats acting as the go between, that system by its self is a massive cause of slowdowns.
Idealy it should be removed entirely so the GPU communicates directly with the CPU "low level"