https://www.zdnet.com/article/cerebras-ceo-big-implications-for-deep-learning-in-companys-big-chip/
Wow a single WSE can handle data with 4 billion parameter model that is 4000 times more than a Nvidia Volta V100 GPU, Intel upcoming Xe GPU for datacentre and AMD Radeon Instinct Vega, Navi GPU and even the fastest supercomputers with thousands of GPUs still limited with 1 million parameter model. 2 WSE machines on cluster can handle 8 billion parameter and 4 WSE machines together can handle 16 billion parameter. Bloody hell, AMD upcoming Frontier supercomputer due in 2021 with 100 cabinets with thousands of AMD Radeon Instinct GPUs on network are limited with 1 million parameter data model will use over 30 MW power. I cant imagine 100 cabinets water cooling 100 WSE will handle 400 billion parameter data model to solve biggest problems ever used only just 1.5 MW power.
Jensen Huang, Lisa Su and Raja Koduri all three will have excessive sweating and very concern now.
It sounds like you're tunnel visioned.
You know a design feature which is theoretically better. But what's the whole picture?