Probably due to immature tech. Not that long ago, Bayer sensors sucked when you upped the ISO.
It is more of a design issue than immature tech, although obviously big improvements could come about. One of the design issues is that the concept works well on CCD sensors but not so well on CMOS, CCD sensors have well known limits on noise (look at all MF backs).
I have long been interested in alternative sensor technology and very much liked the Foveon concept and hoped we would see CaNikon try something similar. However there are very good reasons why the big names have stuck to a Bayer design. The latest generation Sony and Nikon sensors show how far this design can still be pushed (and yes, Nikon is producing plenty of their own sensors as good as Sony, the D4 sensor, D3 and D3s, the D5200 and D77100 are Toshiba manufactured, the D3200 is not a Sony sensor either).
There is no free lunch in sensor design. On the outside it looks like Foveon is a simple but massive boost in design. A single pixel of a Bayer sensor throws away 2/3rds of the incoming light (assuming uniform white light) before the photons even touch silicon. However, the Foveon sensors are actually much worse. While every pixel can record R, G and B wavelengths, only the blue sensors on the surface can be at full efficiency. The R and G sensors are receiving far fewer photons (red wavelength light does penetrate the silicone substrate further than blue light, the main design principle, but there is still loss of photons at higher levels of the silicon).
The Foveon sensors are actually much less efficient than a Bayer sensor, which is why the noise is so high above base sensitivity. The loss of light is especially bad in low-light environments.
One of the reasons Bayer sensors are doing well going into higher pixel densities is they actually start behaving closer to a Foveon design, but with greater efficiency. If you assume at a given pixel size a Foveon sensor can measure R, G and B, a Bayer design with pixels 3 times smaller can record the same R, G B pixels int he same area. Current demosaicing algorithms still produce a final output with the 3 pixels mapped to a RGB colorspace but one can simple imagine than the 3 measured pixels are used to produce a single output pixel (same concept that Sigma claim a 45MP sensor but leave the output at 15MPx3colors). We can replicate this by simply downsizing the image. When you do that you see the camera like the D800 performs so well, hence the high DXOMark score.
Moreover, if you understand DSP then you will realize that oversampling by a factor of at least 2x at capture time is actually required to fully replicate a source signal. To get the most out of a lens you need to sense at least twice the linear resolution, then you can downsample to produce the optimal resolution image signal. This is actually what hat Nokia smartphone does with the 41MP sensor. It captures at very high resolution, in software combines and filters the data to produce high quality 4-6MP output images.
The downside is that it is harder to maintain the same quantum efficiencies at smaller pixel sizes but so far this has not been a big issue. The D800 is actually a more efficient camera than the D4 (or 1D-X). The D5200 24MP APS_C sensor is currently around the most efficient I believe, despite the highest pixel density. Surprisingly, there are compact cameras that have a similar or higher efficiency. Look at figure 6 in
http://www.luminous-landscape.com/essays/dxomark_sensor_for_benchmarking_cameras2.shtml, the orange line for the latest compact cameras is higher than the blue line for FF DSLRs, these are normalization liens, e..g if you take a compact sensor and scaled it to FF size, and vice versa. From that it is easy to imagine that a a latest gen compact sensor when scaled to FF size will lead to hundreds of MPs but IQ of an appropriately re-sampled image can surpass current FF sensors.
Personally, I don;t see the Foveon technology really going very far since one can simply make a Bayer sensor at a higher density and down-sample to get equal or better results with a simpler and cheaper sensor design.
The bottom line is at low light levels you are fighting for every single photon. Neither of these two technology are completely efficient in this regard throwing out large amounts of light. One concept that can work well and is used in video camera is to use a prism and split the light into R, G and B channels directed towards a dedicated sensor. Not demosaicing, not interpolation, no color filtering, no throwing away photos into dead silicon. The design is fairly standard, the issue is it doesn't really fit within the DSLR format.
Personally I see the future in using more advanced software-ware to combine multiple captures to increase resolution, increase DR and increase noise performance. The basics are already there, e.g. smart phone let you capture panoramic with auto stitching, some cameras have multi-shot HDR modes, some cameras have multi-shot high ISO noise reduction modes. We all know you can take a panoramic photo by taking multiple shots (rotate around lens nodal point preferred) creating very high MP images. Common in astrophotogrpahy is to take multiple exposures and stack them to reduce noise, I think Sony cameras offer a similar option these days. Other neat things you can do is take multiple shots of say a famous building that unfortunately has a load of tourists walking all over the front. By taking multiple images you can combine them to remove them from the final output. Seen this done to capture great photos of places that are normally too busy to capture without interference. Then there is even the possibility to generate high-res 3D models from multiple stills.
What we have seen so far in consumer cameras is very basic to what the latest research is producing ( I used to worked closely with people who researched computer vision techniques) and the rate of progress of research is really rapid.