L3 cache :- multi-threading

phil128 · 12 May 2011 at 09:19

Hi, all.

Just a quick question. Having a L3 cache that's shared amongst all cores allow multi threading applications to benefit from having this. For instance, if you have 4 threads running, instead of queuing the memory controller for the location of the data within memory to obtain, could L3 cache be used instead of memory (for instance having 4 threads accessing a crucial variable (4 byte - Int)), so the RAM isn't touched at all, it's all processed within the L3 cache(providing there's enough space)?

thanks.

twist3d0n3 · 12 May 2011 at 09:25

cache memory is really really fast and so really really expensive, which is why you only have a small amount in the cpu. you'd still need ram as an intermediary between hd's/ssd's/pci-e stuff.

SeeKified · 12 May 2011 at 09:39

Yes, the L2 on my old A64 3700+ was pushing 35GB/s - and that's on a CPU built in 2005. It also has a pricetag to fit. Having a few gigs of cache would be nice, but it's not going to happen.

topdog@OC · 12 May 2011 at 13:41

As mentioned, the L3 cache is just one more stop-gap before having to resort to RAM.

More cores used in a program concurrently chew up the available cache more quickly, since they need multiple copies of it per core, the L3 doesn't provide a new memory access paradigm to overcome this, it just provides a quicker buffer to use than the RAM for shuffling things in/out of the per core L1 and L2 cache.

The responsibility on programmers is all the same (for now) as far as I've understood it, whether there is or isn't an L3 cache, or whether it is or isn't a certain size, in that they need to break up their threads to minimize cache misses (or cache hits, but to data that's part of range of addresses in a cache 'line' that other cores are modifying other addresses of.. I forget what that is called exactly, but Intel have compiler profiling/tuning tools to help spot when this kind of stuff happens).

So an app which processes an image by giving each pixel to each thread in turn like 12341234123412341234 will quickly bog down and run badly as (say) each 8 bytes is attempting to be read/written concurrently by multiple cores, and each CPU / cache has refreshing it's own local copy of each 8 byte block before proceeding.

So long as the programmers break it up so that the pixels are handled by threads like this, the cache overhead of dividing a piece of work up between them is mitigated:

111111111111111111111111111111111111111....11111222222222222222222222222222...222222233333333333333333333333333333333333333....33333344444444444444444444444444444444444...444444

Thread 1 now can zip through all of its addresses in each cache line / block, and other processors won't be needing to interfere with the progress of updates to the same cache line.

Do it like this, and you may not even consume the bandwidth of your cache to the point of exhaustion.

Personally I would think for future scalability a transactional memory model (like the policy used in databases, particularly with regards to being able to have many readers to a consistent memory state whilst uncommitted writes are isolated until a particular batch is ready to commit the changes) would be wonderful, but the existing implementations seem limited to proprietary systems and research for now. Not much in the way of this can be used in consumer systems it seems.

Competitor rules

L3 cache :- multi-threading

More options

phil128

phil128

twist3d0n3

twist3d0n3

SeeKified

SeeKified

topdog@OC

topdog@OC