Friday 16 March 2012

Understanding CPU caching and performance

Understanding CPU caching and performance:


Block sizes

In the section on spatial locality I mentioned that storing whole blocks is one way that caches take advantage of spatial locality of reference. Now that we know a little more about how caches are organized internally, we can look a bit closer at the issue of block size. You might think that as cache sizes increase you could take even better advantage of spatial locality by making block sizes even bigger. Surely fetching more bytes per block into the cache would decrease the odds that some part of the working set will be evicted because it resides in a different block. This is true, to some extent, but we have to be careful. If we increase the block size while keeping the cache size the same, then we decrease the number of blocks that the cache can hold. Fewer blocks in the cache means fewer sets, and fewer sets means that collisions and therefore misses are more likely. And of course, with fewer blocks in the cache the likelihood that any particular block that the CPU needs will be available in the cache decreases.
The upshot of all this is that smaller block sizes allow us to exercise more fine-grained control of the cache. We can trace out the boundaries of a working set with a higher resolution by using smaller cache blocks. If our cache blocks are too large, we wind up with a lot of wasted cache space because many of the blocks will contain only a few bytes from the working set while the rest is irrelevant junk. If we think of this issue in terms of cache pollution, we can say that large cache blocks are more prone to pollute the cache with non-reusable data than small cache blocks.
The following image shows the memory map we've been using, with large block sizes.
This next image shows the same map, but with the block sizes decreased. Notice how much more control the smaller blocks allow over cache pollution.
The other problems with large block sizes are bandwidth-related. Since the larger the block size the more data is fetched with each LOAD, large block sizes can really eat up memory bus bandwidth, especially if the miss rate is high. So a system has to have plenty of bandwidth if it's going to make good use of large cache blocks. Otherwise, the increase in bus traffic can increase the amount of time it takes to fetch a cache block from memory, thereby adding latency to the cache.

Write Policies: Write through vs. Write back

So far, this entire article has dealt with only one type of memory traffic: loads, or requests for data from memory. I've only talked about loads because they make up the vast majority of memory traffic. The remainder of memory traffic is made up of stores, which in simple uniprocessor systems are much easier to deal with. In this section, we'll cover how to handle stores in single-processor systems with just an L1 cache. When you throw in more caches and multiple processors, things get more complicated than I want to go into, here.??
Once a retrieved piece of data is modified by the CPU, it must be stored or written back out to main memory so that the rest of the system has access to the most up-to-date version of it. There are two ways to deal with such writes to memory. The first way is to immediately update all the copies of the modified data in each level of the hierarchy to reflect the latest changes. So a piece of modified data would be written to the L1 and main memory so that all of its copies are current. Such a policy for handling writes is called write through, since it writes the modified data through to all levels of the hierarchy.?
A write through policy can be nice for multiprocessor and I/O-intensive system designs, since multiple clients are reading from memory at once and all need the most current data available. However, the multiple updates per write required by this policy can greatly increase memory traffic. For each STORE, the system must update multiple copies of the modified data. If there's a large amount of data that has been modified, then that could eat up quite a bit of memory bandwidth that could be used for the more important LOAD traffic.?
The alternative to write through is write back, and it can potentially result in less memory traffic. With a write back policy, changes propagate down to the lower levels of the hierarchy as cache blocks are evicted from the higher levels. So an updated piece of data in an L1 cache block will not be updated in main memory until it's evicted from the L1.?

Conclusions

There is much, much more that can be said about caching, and this article has covered only the basic concepts. In the next article, we'll look in detail at the caching and memory systems of both the P4 and the G4e. This will provide an opportunity note only to fill in the preceding, general discussion with some real-world specifics, but also to introduce some more advanced caching concepts like data prefetching and cache coherency.????

Bibliography

  • David A. Patterson and John L. Hennessy, Computer Architecture: A Quantitative Approach. Second Edition. Morgan Kaufmann Publishers, Inc.: San Francisco, 1996.?
  • Dennis C. Lee, Patrick J. Crowley, Jean-Loup Baer, Thomas E. Anderson, and Brian N. Bershad, "Execution Characteristics of Desktop Applications on Windows NT." 1998.http://citeseer.nj.nec.com/lee98execution.html
  • Institute for System-Level Integration, "Chapter 5: The Memory Hierarchy."?
  • Manish J. Bhatt, "Locality of Reference." Proceedings of the 4th Pattern Languages of Programming Conference. 1997. http://st-www.cs.uiuc.edu/users/hanmer/PLoP-97/
  • James R. Goodman, "Using Cache Memory to Reduce Processor-Memory Traffic." 25 Years ISCA: Retrospectives and Reprints 1998: 255-262
  • Luiz Andre Barroso, Kourosh Gharachorloo, and Edouard Bugnion, "Memory System Characterization of Commercial Workloads." Proceedings of the 25th International Symposium on Computer Architecture. June 1998.

Revision History

No comments:

Post a Comment