As mentioned, the L3 cache is just one more stop-gap before having to resort to RAM.
More cores used in a program concurrently chew up the available cache more quickly, since they need multiple copies of it per core, the L3 doesn't provide a new memory access paradigm to overcome this, it just provides a quicker buffer to use than the RAM for shuffling things in/out of the per core L1 and L2 cache.
The responsibility on programmers is all the same (for now) as far as I've understood it, whether there is or isn't an L3 cache, or whether it is or isn't a certain size, in that they need to break up their threads to minimize cache misses (or cache hits, but to data that's part of range of addresses in a cache 'line' that other cores are modifying other addresses of.. I forget what that is called exactly, but Intel have compiler profiling/tuning tools to help spot when this kind of stuff happens).
So an app which processes an image by giving each pixel to each thread in turn like 12341234123412341234 will quickly bog down and run badly as (say) each 8 bytes is attempting to be read/written concurrently by multiple cores, and each CPU / cache has refreshing it's own local copy of each 8 byte block before proceeding.
So long as the programmers break it up so that the pixels are handled by threads like this, the cache overhead of dividing a piece of work up between them is mitigated:
111111111111111111111111111111111111111....11111222222222222222222222222222...222222233333333333333333333333333333333333333....33333344444444444444444444444444444444444...444444
Thread 1 now can zip through all of its addresses in each cache line / block, and other processors won't be needing to interfere with the progress of updates to the same cache line.
Do it like this, and you may not even consume the bandwidth of your cache to the point of exhaustion.
Personally I would think for future scalability a transactional memory model (like the policy used in databases, particularly with regards to being able to have many readers to a consistent memory state whilst uncommitted writes are isolated until a particular batch is ready to commit the changes) would be wonderful, but the existing implementations seem limited to proprietary systems and research for now. Not much in the way of this can be used in consumer systems it seems.