It is possible, but RNNs are mostly related to the prediction of a time series, while image upscaling is convolutional in nature. The same underlying dynamics that allow CNNs to be the state of the art in image recognition allows them to be the state of the art in image reconstructions essentially, the universe we live in is highly structured and NN can learn the statistical relationships in this structure).
The temporal accumulation is to increase the amount of information available in a spatial domain before spatial upscaling via CNNs. One simplistic way to see what happens is if ever even frame you render the even frames, ever odd frame the odds frames, then is you accumulate 2 frames you have complete information - under the caveat that nothing moves. When there is motion, you have to use motion vectors to predict the displacement. This doesn't require ML, straight algebra. You do this well, you get good results with a few moiton problems, as seen in UE5 TSR.
What DLSS does, is the temporal accumulated and projected information is combined with additional spatial data such as the depth buffer, as well as the original motion vectors, and fed into a large CNN. This CNN not only applies state of the art spatial uscaling, but does so on an image that is already far more detailed than the rendered final frame. More so, the CNN is trained to not only upscale a perfect image, but is trained to correct for some of the motion artifacts that temporal accumulation & projection. The model learn motion related distorion from the accumulated spatial image and the associated motion vectors, learning a distortion function. The dpeth buffer and other spatial data is used by the model to better enhance edges as a form of morphological AA, but without an explicit AA algorithm. The actual implementation may well consistent of several DL models, could eaisaly be one that corrects for motion distortion, and the output fed into a more standard image reconstruction network