Microprocessor Design/Cache

Cache

Cache is a small amount of memory, usually located on the die with the processor, which operates more quickly than external RAM. Data is moved from the main memory to the Cache, so that it can be accessed faster. Modern chip designs allocate a large amount of chip area to cache. Increasing chip performance is typically acheived by increasing the speed and efficiency of chip cache.

Cache works by storing a small subset of the external memory contents, typically out of it's original order. Data and instructions that are being used frequently, such as a data array or a small instruction loop, are stored in the cache and can be read quickly without having to access the main memory. Cache runs at the same speed as the rest of the processor, which is typically much faster then the external RAM operates at. This means that if data is in the cache, accessing it is faster then accessing memory.

Cache helps to speed up processors because it works on the principal of locality.

Principal of Locality

There are two types of locality, spatial and temporal. Modern computer programs are typically loop-based, and therefore we have two rules about locality:

Spatial Locality: When a data item is accessed, it is likely that data items in sequential memory locations will also be accessed. Consider the traversal of an array, or the act of storing local variables on a stack. In these cases, when one data item is accessed, it is a good idea to load the surrounding memory area into the cache at the same time.
Temporal Locality: When data item is accessed, it is likely that the same data item will be accessed again. For instance, variables are typically read and written to in rapid succession. If is a good idea to keep recently used items in the cache, and not over-write data that has been recently used.

Hit or Miss

A hit when talking about cache is when the processor finds data in the cache that it is looking for. A miss is when the processor looks for data in the cache, but the data is not available. In the event of a miss, the cache controller unit must gather the data from the main memory, which can cost more time for the processor.

Flushing the Cache

When the processor needs data, it looks in the cache. If the data is not in the cache, it will then go to memory to find the data. Data from memory is moved to the cache and then used by the processor. Sometimes the entire cache contains useless or old data, and it needs to be flushed. Flushing occurs when the cache controller determines that the cache contains more potential misses then hits. Flushing the cache takes several processor cycles, so much research has gone into developing algorithms to keep the cache up to date.

Cache Hierarchy

Cache is typically divided between multiple levels. The most common levels are L1, L2, and L3. L1 is the smallest but the fastest. L3 is the largest but the slowest. Many chips do not have L3 cache. Some chips that do have an L3 cache actually have an external L3 module that exists on the motherboard between the microprocessor and the RAM.

Size of Cache

The Pentium Pro chip was one of the largest microprocessors ever manufactured. It was so large because it contained the largest cache of any chip at the time.

There are a number of factors that affect the size of cache on a chip:

Moore's law typically provides more transistors on a chip then a designer can use to make a chip. These extra transistors are easily converted to large caches.
Processor components become smaller as transistors become smaller. This means there is more area on the die for additional cache.
More cache means fewer delays in accessing data, and therefore better performance.

Because of these factors, chip caches tend to get larger and larger with each generation of chip.

Template:-

Cache Tagging

Cache can contain non-sequential data items in no particular order. A block of memory in the cache might be empty and contain no data at all. In order for hardware to check the validity of entries in the cache, every cache entry needs to maintain the following pieces of information:

A status bit to determine if the block is empty or full
The memory address of the data in the block
The data from the specified memory address

When the processor looks for data in the cache, it sends a memory address to the cache controller. the cache controller checks the address against all the address fields in the cache. If there is a hit, the cache controller returns the data. If there is a miss, the cache controller must pass the request to the next level of cache or to the main memory unit.

A diagram of cache showing non-sequential data

The memory address of the data in the cache is known as the tag.

Cache Writing

Write operations take as long to perform as read operations in main memory. Many cached processors therefore will perform write operations on the cache as well as read operations. When data is written to memory, a write request is sent simultaneously to the main memory and to the cache. This way, the result data is available in the cache before it can be written (and then read again) from the main memory. When writing to the cache, it's important to make sure the main memory and the cache are synchronized and they contain the same data.

Memory Stall Cycles

If the cache misses, the processor will need to stall the current instruction until the cache can fetch the correct data from a higher level. The amount of time lost by the stall is dependant on a number of factors. The number of memory accesses in a particular program is denoted as A_m. The rate of misses, the probability that any particular access will miss) is denoted r_m. The amount of time lost for each miss is known as the miss penalty, and is denoted as P_m. We can calculate the amoung of time wasted because of cache miss stalls as:

stall time = A_{m} \times r_{m} \times P_{m}

Likewise, if we have the total number of instructions in a program, N, and the average number of misses per instruction, MPI, we can calculate the lost time as:

stall time = N \times M P I \times P_{m}

If instead of lost time we measure the miss penalty in the amount of lost cycles, the calculation will instead produce the number of cycles lost to memory stalls, instead of the amount of time lost to memory stalls.

Read Stall Times

To calculate the amount of time lost to cache read misses, we can perform the same basic calculations as above:

read-stall time = A_{r} \times r_{r} \times P_{r}

A_r is the average number of read accesses, r_r is the miss rate on reads, and P_r is the time or cycle penalty associated with a read miss.

Write Stall Times

Determining the amount of time lost to write stalls is similar, but an additional additive term that represents stalls in the write buffer needs to be included:

write-stall time = A_{w} \times r_{w} \times P_{w} + T_{w b}

Where T_wb is the amount of time lost because of stalls in the write buffer. The write buffer can stall when the cache attempts to synchronize with main memory.

Hierarchy Stall Times

In a hierarchical cache system, miss time penalties can be compounded when data is missed in multiple levels of cache. If data is missed in the L1 cache, it will be looked for in the L2 cache. However, if it also misses in the L2 cache, there will be a double-penalty. The L2 needs to load the data from the main memory (or the L3 cache, if the system has one), and then the data needs to be loaded into the L1. Notice that missing in two cache levels and then having to access main memory takes longer then if we had just accessed memory directly.

Design Considerations

L1 cache is typically designed with the intent of minimizing the time it takes to make a hit. If hit times are sufficiently fast, a sizable miss rate can be accepted. Misses in the L1 will be redirected to the L2 and that is still significantly faster then accesses to main memory. L1 cache tends to have smaller block sizes, but benefits from having more available blocks for the same amount of space. In order to make L1 hit times minimal, L1 are typically direct-mapped or even narrowly 2-way set associative.

L2 cache, on the other hand, needs to have a lower miss rate to help avoid accesses to main memory. Accesses to L2 cache are much faster then accesses to memory, so we should do everything possible to ensure that we maximize our hit rate. For this reason, L2 cache tends to be fully associative with large block sizes. This is because memory is typically read and written in sequential memory cells, so large block sizes can take advantage of that sequentiality.

L3 cache furter continues this trend, with larger block sizes, and minimized miss rate.

Associativity

In order to increase the read speed in a cache, many cache designers implement some level of associativity. An associative cache creates a relationship between the original memory location and the location in the cache where that data is stored. The relationship between the address in main memory and the location where the data is stored is known as the mapping of the cache. In this way, if the data exists in the cache at all, the cache controller knows that it can only be in certain locations that satisfy the mapping.

Direct-Mapped

A direct-mapped system uses a hashing algorithm to assign an identifier to a memory address. A common hashing algorithm for this purpose is the modulo operation. The modulo operation divides the address by a certain number, p, and takes the remainder r as the result. If a is the main memory address, and n is an arbitrary positive integer, then the hashing algorithm must satisfy the following equation:

a = p \times n + r

If p is chosen properly by the designer, data will be evenly distributed throughout the cache.

In a direct-mapped system, each memory address corresponds to only a single cache location, but a single cache location can correspond to many memory locations. The image above shows a simple cache diagram with 8 blocks. All memory addresses therefore are calculated as n mod 8, where n is the memory address to read into the cache. Memory addresses 0, 8, and 16 will all map to block 0 in the cache. Cache performance is worst when multiple data items with the same hash value are read, and performance is best when data items are close together in memory (such as a sequential block of program instructions, or a sequential array).

2-Way Set Associative

In a 2-way set associative cache system, the data value is hashed, but each hash value corresponds to a set of cache blocks. Each block contains multiple data cells, and a data value that is assigned to that block can be inserted anywhere in the block. The read speeds are quick because the cache controller can immediately narrow down its search area to the block that matches the address hash value.

Fully Associative

In a fully-associative cache, hash algorithms are not employed and data can be inserted anywhere in the cache that is available. A typical algorithm will write a new data value over the oldest unused data value in the cache. This scheme, however, requires the time an item is loaded or accessed to be stored, which can require lots of additional storage.

Cache Misses

There are three basic types of misses in a cache:

Conflict Misses
Compulsary Misses
Capacity Misses

Conflict Misses

A conflict miss occurs in a direct-mapped and 2-way set associative cache when two data items are mapped to the same cache locations. In a data miss, a recently used data item is overwritten with a new data item.

Compulsary Misses

The image above shows the difference between a conflict miss and a compulsary miss. A compulsary miss is an instance where the cache must miss because it does not contain any data. For instance, when a processor is first powered-on, there is no valid data in the cache and the first few reads will always miss.

The compulsary miss demonstrates the need for a cache to differentiate between a space that is empty and one that is full. Consider when we turn the processor on, and we reset all the address values to zero, an attempt to read a memory location with a hash value of zero will hit. We do not want the cache to hit if the blocks are empty.

Capacity Misses

Capacity misses occur when the cache block is not large enough to hold the data item.

Write Policy

Data writes require the same time delay as a data read. For this reason, caching systems typically will write data to the cache as well. However, when writing to the cache, it is important to ensure that the data is also written to the main memory, so it is not overwritten by the next cache read. If data in the cache is overwritten without being stored in main memory, the data will be lost.

It is imperative that caches write data to the main memory, but exactly when that data is written to the main memory is called the write policy. There are two write policies: write through and write back.

Write Through

In a write through system, data that is written to the cache is immediately written to the main memory as well. If many writes need to occur is sequential instructions, the write buffer may get backed up and cause a stall.

Write Back

In a write back system, the cache controller keeps track of which data items have been synchronized to main memory. The data items which have not been synchronized are called "dirty", and the cache controller prevents dirty data from being overwritten.

The cache controller will synchronize data during processor cycles where no other data is being written to the cache.

Stale Data

It is possible for the data in main memory to be changed by a component besides the microcontroller. For instance, many computer systems have memory-mapped I/O, or a DMA controller that can alter the data. It is important that the cache controller check that data in the cache is correct. Data in the cache that is old and may be incorrect is called "stale".

Microprocessor Design/Cache

Contents

Cache

Principal of Locality

Hit or Miss

Flushing the Cache

Cache Hierarchy

Size of Cache

Cache Tagging

Cache Writing

Memory Stall Cycles

Read Stall Times

Write Stall Times

Hierarchy Stall Times

Design Considerations

Associativity

Direct-Mapped

2-Way Set Associative

Fully Associative

Cache Misses

Conflict Misses

Compulsary Misses

Capacity Misses

Write Policy

Write Through

Write Back

Stale Data

Navigation menu

Microprocessor Design/Cache

Cache

Principal of Locality

Hit or Miss

Flushing the Cache

Cache Hierarchy

Size of Cache

Cache Tagging

Cache Writing

Memory Stall Cycles

Read Stall Times

Write Stall Times

Hierarchy Stall Times

Design Considerations

Associativity

Direct-Mapped

2-Way Set Associative

Fully Associative

Cache Misses

Conflict Misses

Compulsary Misses

Capacity Misses

Write Policy

Write Through

Write Back

Stale Data

Navigation menu

Search