Frame Buffers
The word "buffer" here refers to the memory used by the GPU to store the frame being rendered. There must be one, else no frame can be generated. The monitor (well, RAMDAC or digital interface) has to read from this buffer to show the frame on the screen. It is possible to run with a single buffer, but this is not done anymore. Imagine slowing the GPU down tremendously... you would see every graphics operation being drawn in real-time on the screen. You don't want this. So, at the very least, Double Buffering is used. Here, the GPU can prepare a frame in one buffer (A), whist the monitor is allowed to read from the other, completed frame (held in the other buffer, B). Only when the GPU has finished a frame completely is the monitor allowed to switch (known as a "flip") to the first buffer (A) and start to show the new frame. The GPU also does a buffer flip and starts rendering the next frame in the second buffer (B). Once another frame is completed they both flip back and start again.
Double buffering with VSync causes a problem: If the FPS falls below the monitor refresh rate, even by 1 FPS (e.g. 59 FPS and 60Hz monitor), the framerate falls to half the refresh rate (e.g. 30 FPS). This is because the GPU only has one "working" buffer with double buffering, the monitor must be given exclusive access to the other for its reads, to prevent partially rendered frames being drawn to screen. The GPU will stall as it has to wait for the monitor to finish drawing the same frame for a second time, all because it was a fraction of a refresh slow to draw the next frame... result: framerate halves, or worse as the FPS drops (e.g. 29 goes to 20, etc.).
Now, TB overcomes this limitation, as the GPU now has two "working" buffers and one more for the monitor. If the FPS falls to 59 there is no problem, it is allowed to flip to its second "working" buffer and prepare a third frame.
The problem with input lag comes down to this...
Lets say that the GPU is so fast, it can prepare more than 60 FPS and we are using VSync + TB. At some point, it WILL finish work on its second "working" buffer whilst waiting for a refresh to finish. It has to. So, we have two frames yet to be shown to the player. Which should we show next? It makes no sense to show the oldest of the two... it is already out of date. So, we show the newest one. But, if we are going to do that, the GPU could have already started to overwrite the oldest one (even before the monitor is ready to flip). This is what it does with proper TB, AFAIK.
Unfortunately, as I read somewhere else, the D3D TB is not, IMHO, correct. It does have (at least) two working buffers, but the frames are shown to the player IN SEQUENCE and none are ever thrown away. True, the game might like to know that a frame has been thrown away, in case it wanted to ensure that the player saw something, but unless it is a subliminal message, one frame long, I see no need for this. Anyway, because they assume that we should not throw frames away, this is what they do. The result: input lag... constant and without remorse. Every frame from then on will be delayed, unless the framerate drops below the refresh rate. Of course, if it climbs again, the lag will return.
Now, impose a FPS limit below the refresh rate (ideally EXACTLY on it, but this is impossible I would think). Now, we ensure that the GPU cannot race ahead and there is no input lag. We still need the three buffers, to prevent the halving of FPS caused by only two buffers, but we prevent lag caused by too many completed frames that insist on being drawn in sequence. However, at least we save power/heat because we let the card idle when framerate goes above refresh rate.
However, we do get frames drawn multiple times, but this is the same as you would get with a game that could only generate, say, 58 FPS and we decided to use VSync. There will always be some "skipped frames". Do you feel this: I think you do, but is the price you pay for VSync and no tearing.
The ideal solution? Monitors that allow dynamic refresh rate, up to some limit (e.g. 60 or 120), that allow the time between frames to vary... meaning that an FPS under the refresh rate does not suffer any frame skipping, and each frame has identical output lag through the interface to the monitor. Given LCD's don't need "refreshing" in the same way as CRTs, I'm amazed this hasn't been done already. There will be some additional input lag caused by longer GPU rendering times, but the result would be much smoother than "frame skipping" / duplicate refreshes.
Martin