Its not stitching images together, it is combining the information captured form different sensors into a single image that has both higher resolution, higher dynamic range, lower noise and greater DoF than a single sensor can achieve while adding additional information such as image depth allow post processing effects such as choosing the focus point and effective aperture.
To get an equivalent sensor size you need to combine the areas of the different sensors used, in which case you get an an area bigger than the 1" rx100.
As for its effective aperture, it basically has infinite DoF which is very important to make it work. This allows you to choose the focus point after the image is taken. To a get shallow DoF software is used but it is a very different to the fake TS rubbish people who don't understand TS lenses produce. Accurate simulation of out of focusing rendering IS possible when you have image depth information and a high enough dynamic range, something a multi sensor camera can achieve. Knowing Image depth then you can simulate any DoF rendering you want with a model of the the lenses Point spread function (easily measured).
It is definitely the future but we have a long way to go. It will also only really make sense in the very high end cameras because you inherently need more sensors, more lenses, more complexity and more cost. For things like Photo Journalism, weddings, events, sports this technology will be huge in the future. Never miss a shot, never have failed focus, unlimited artistic potential
Multi sensor imaging is already the norm in areas like video where Red-green-blue channels are split o different sensors. With our ancient single sensor DSLRs we throw away two-thirds of the light coming into the lens!