Do they need to be discrete chiplets for MCM or could they employ a similar approach to Apples M1 Ultra, effectively dies glued together with interconnects. ?
There are a variety of different ways MCM type approaches could be used for GPUs - like with CPUs with discrete chiplets and a traditional bus is easy to do and scales well for compute tasks, but is pretty much impossible to overcome the limitations of SLI or Crossfire when it comes to gaming use. Similar could be done but with the cores having more complex interconnects and/or sideports for memory access, etc. but so far that only gives a fairly small performance gain while still having many of the limitations of SLI/CF when it comes to gaming while increasingly complexity a lot. If the interconnects within the interposer are high enough performance it starts to open up other possibilities such as moving large areas like some types of cache out of the main package so as to put more of other stuff in the main package (which might be what AMD are doing with RDNA3) or some form of spreading the functionality of a GPU out over multiple packages which might involve splitting out processing and command parts of the GPU, or having reprogrammable on the fly dumb chiplets (kind of like with Intel Larrabee) where the workload can be divided out as needed repurposing the resources of processing blocks as any given workload requires, etc.
Some general information on the nVidia work here https://research.nvidia.com/sites/default/files/publications/ISCA_2017_MCMGPU.pdf