That's pretty interesting. So latency between cores on different CCXs is 3.5x as high as between cores on the same CCX. Some rough numbers from their graphs:
Intel SMT latency (same physical core, different logical core): 15 ns
AMD SMT latency (same physical core, different logical core): 25 ns
Intel core latency (different physical core): 80 ns
AMD core latency (different physical core, same CCX): 45 ns
AMD core latency (different physical core, different CCX): 145 ns
It also shows that Windows does understand AMD's SMT implementation, at least in terms of the core layout. Patching Windows to understand the CCX implementation would very likely garner the biggest improvement.
I believe cache is local to its CCX. So if you switch something over to a different CCX, it's starting with a blank cache.