Update

See https://github.com/dbaarda/DLFUCache for a sample implementation in Python.

I recently discovered (unsurprisingly) that this idea had already been described as LRFU (Least Recently/Frequently Used). I first found a reference to it in a comparison of cache policies. Most of the documents describing this are hidden behind paywalls, but I did find a power point presentation and a github example implementation in Python. There also appears to be another cache expiry called LFRU that also merged LRU and LFU in a different way (two tiered entry classes).

The presentation suggests that the authors didn't realize their decay function was an exponentially decaying count (because of their use of 2 instead of e, and the explanation of how this can be incrementally calculated instead of just treating that as assumed knowledge), which means they didn't seem to realize that their CRF (Combined Recency and Frequency value) was the exponentially decaying count of references in the past T timeconstant interval. Knowing this might have given them better insights into tuning their decay as a function of cache size.

Converting from the LRFU decay function to exponential decay gives us:

F(dt) = (1/2)^(L*dt)
      = 2^(-L*dt)
      = e^(-ln(2)*L*dt)
      = e^(-(1/T)*dt)

so:

ln(2)*L = (1/T)
T = 1/(ln(2)*L)

Where:

L is the LRFU "decay constant".
T is the exponentially decaying timeconstant.
dt is the time since the last access or decay evaluation.

Note that they also didn't seem to be aware of the simple approximation for this (provided dt is small relative to T) of:

F(dt) = T / (T + dt)

The value of L~=0.0002 they got for a best performing cache with 2000 entries equates to T~=7000, or ~3.5x the cache size.

They kept both an atime and a decaying access count per entry, but you can do this with just a decaying count per entry if you exponentially grow the increment amount instead of exponentially decaying all the entry's counts. You then periodically (amortized O(1)) rescale all the entry's counts and the increment amount back to 1.0 to avoid overflowing.

Their insight of using a mixture of a heap and a linked list to reduce the complexity of cache updates is interesting but of questionable value. For the best performing L values you need a threshold that puts everything in the heap anyway. It's worth noting that another way of expressing the "threshold" is the top entry of the heap needs to have a score of <= 1.0 accesses in the past T time, and that the heap/list division point could be dynamically adjusted to take into account cache access behavior; just keep popping values off the heap into the list until the top of the heap has a score > 1.0. It might be possible to make operations approach O(1) on average because the top of the heap is typically where new references end up. However, even putting half the entries in the list only removes 1 iteration from the O(ln(N)) heap operations, so it's likely the extra complexity is worse than the performance wins.

They also didn't consider keeping metadata for entries beyond what is stored in the cache. This is something that ARC uses. I have a feeling that, particularly for large T values, keeping additional metadata is critical, since without it fetch history is thrown away every time an entry is expired. This can result in popular entries getting regularly fetched and flushed before their next fetch, so never accumulating enough fetch history to indicate they actually have high access counts in the last T time. Note you can probably just use a list instead of a heap for storing additional entry metadata beyond what is stored in the cache.

Introduction

The idea is similar to LFU cache expiry;

http://en.wikipedia.org/wiki/Least_Frequently_Used

Except you add an exponential decay to the access counter. At every request all the counters are decayed before the new access is incremented. This means the access counters represent the exponentially decaying average access count over the last T interval, where T is the timeconstant of the exponential decay used. This should prevent the scan cache-thrashing of LRU, while also preventing the page-burst cache sticking of LFU.

The decay interval could be measured in real time or cache accesses. It's a bit simpler and may work better to use cache accesses, but using time would work better if cache locality patterns tend to depend on real time more than the cache access rate.

Note that this is similar to http://en.wikipedia.org/wiki/Page_replacement_algorithm#Aging, but more generalized.

Implementation

An inverted priority queue can be used to keep the expiry order. Before any access all counters are decayed. On a hit, the page is pulled, incremented, and pushed back into the priority queue. On a miss the lowest count in the queue is pulled, and the new loaded page pushed in with a counter of 1.0. Note that decaying all the counters does not effect their order in the priority queue, since they all decay at the same rate.

Note that the decay operation is:

count = T/(T+dt) * count

Where dt is the interval (in accesses or real time) since the last decay was calculated. If you are measuring the interval in accesses and updating every access, then dt=1. You can also just halve the value every T interval.

To avoid decaying all the counters every access, it is possible to also keep a "time" with every access counter that records when it was last accessed, and the decay calculation can be deferred until that node is updated/compared. Note that the comparison function used in the priority queue needs to calculate the decayed count for comparing. This doubles the metadata size, makes the O(log(N)) pqueue insert/remove operations a bit more expensive, but removes the O(N) count decaying step.

Alternatively, the O(N) decaying does not need to be done at every access. Instead it can be done periodically, every fixed number of accesses or time interval. A periodic full decay does make the cache update overheads less predictable, which could be a problem for some applications (but then caches already have this problem). This also means you loose "time resolution" differences between accesses within the same "decay interval", so you want to keep it small relative to the cache churn rate. Alternatively this can be solved by exponentially growing the increment amount above 1.0 to compensate for time since the last decay cycle.

The counter could be implemented as a float, or a "fixed point" integer. Assuming the fixed point is after 8bits, you would increment it by 1<<8=256 for each access, and decaying could be implemented by a right-shift every T interval. You need to be careful not to overflow the count, making sure the decay rate and updates are sufficient to keep it within the int range. Since each T interval you "shave of a bit" doing the shift-right, you need to keep T smaller than (maxint>>8)/2 accesses to ensure the count doesn't overflow when repeatedly hitting the same entry.

The priority queue doesn't need to be in the correct order until you need to expire an old entry. This means you can defer updating it until there is a cache miss. You could just O(N) heapify it on each cache miss, but if you are going to do that you might as well loose the priority queue and just O(N) find min. Alternatively you could keep a small list of "dirty" entries that have updated counts but have not yet been adjusted in the priority queue, and adjust them all on a cache miss. This avoids the O(ln(N)) pqueue insert/delete operations for duplicate hits between cache misses.

Using a unsorted list vs pqueue depends on the size of your cache and hitrate. A pqueue is 2*ln(N) comparisons per lookup to insert/move the pquentry regardless of the hitrate. An unordered list is (1-H)*N comparisons per lookup to find the min count (H is hitrate). This means an unsorted list is better if:

2*ln(N) > (1-H)N
1-2*ln(N)/N < H

So you need a very small cache and/or very high hitrate for the unordered list to be better. This suggests using a pqueue is the better option.

Another option is a sorted list with a cursor pointing at the first entry with count=1.0. Cache misses O(1) pop the lowest count and insert searching down from and adjusting the cursor, which should be O(1). Cache hits move the entry down the list from their current position, which should hopefully not be far from their current position, which is worst case O(N) but maybe O(1) or O(ln(N)) on average.

Using deferred delay calculations vs doing them every I interval depends on the size of your cache and how big I is. Assuming a pqueue is used, deferred decay calculations would be done every compare, or 2*ln(N) times per access. Doing them every I intervals means they are done on average N/I times per access. So deferred delay calculations are better if:

2*ln(N) < N/I
N/(2*ln(N)) > I

So deferred decay calculations are only worth it if your I is very small relative to your cache size N. Note deferred decay calculations also require additional storage for "atime" per entry.

Note that the decay calculation is relatively cheap, so depending on other metadata requirements and storage costs, it could be better to keep an "atime" and only decay the count at each hit, recalculating the decay O(ln(N)) times every time an entry is moved/inserted into the pqueue.

I suspect that the simplest efficient implementation would use an integer "fixed point" counter and a hit-increment exponentially growing from C=256 using integer fractions. The increment and all counters are decremented by halving them with a shift right when the increment reaches 512 (which will be slightly less than T accesses). At any time an entries average access count in the last T is entry.count/C. To avoid overflowing we'd need a T < intMax/512. The integer fraction exponentially growing hit-increment looks like:

def __init__(self, T):
  self.T = T
  self.C = 256
  self.F = 0
  ...

def get_increment(self):
  """Exponentially grow and return the access count increment."""
  # This is equivalent to exponentially growing C *= (T+1)/T
  self.F += self.C
  if self.F >= self.T:
      self.C += 1
      self.F -= self.T
  if self.C >= 512
    self.decay_heap()
  return self.C

def decay_heap(self):
  """Decay all entry's and the access count increment."""
  for e in self.heap:
    e.count >>= 1
  self.C >>= 1
  self.F >>= 1

def increment_entry(self, entry):
  """Increment an entry's access count."""
  entry.count += self.get_increment()

Though also note that this is probably only optimal for a compiled implementation. For interpreted languages like Python it's probably more efficient to use floats and avoid the extra interpreted code:

def __init__(self, T):
  self.T = T
  self.C = 1.0
  ...

def get_increment(self):
  """Exponentially grow and return the access count increment."""
  self.C *= (self.T + 1.0) / self.T
  if self.C >= 2.0
    self.decay_heap()
  return self.C

def decay_heap(self):
  """Decay all entry's and the access count increment."""
  for e in self.heap:
    e.count *= 0.5
  self.C *= 0.5
  self.F *= 0.5

def increment_entry(self, entry):
  """Increment an entry's access count."""
  entry.count += self.get_increment()

Extensions

It's possible to keep count metadata for more pages than are kept in the cache so that access history for recently flushed pages is retained to "boost them" if they are re-fetched. The extra metadata would not need a heap, and could just be a list, provided it only includes entries with counts <=1.0.

Depending on cache tuning it may be desirable that freshly accessed pages are not even initially cached (because a freshly incremented count is less than the smallest count in the cache), bypassing the cache entirely until their access history indicates they are worth caching. This would mean scans bypass the cache entirely. In this case you would end up with counts >=1.0 and any extra metadata would also need a heap.

Tuning

The T decay rate could be fixed, but it might make sense to have it dynamically adjust based on the performance of the cache. The overall performance of the cache can be calculated as the average of all the access counters. A number below 1.0 suggests very bad performance, with less than 1.0 access per cache entry average over the last T interval. A number >1.0 suggests good cache hitrates. Note that the cache performance can be incrementally updated every cache access without having to scan all the cache entries. The average is just decayed the same as each counter, if there was a miss, decrement it by the flushed page counter, then add 1.0 for the new page access.

Note that dynamically adjusting the T decay timeconstant messes a bit with deferred delay calculations, since very old pages in the cache are likely to have "decayed" at different rates. It's also possible that the order in the pqueue could be wrong. This means when you do adjust the T timeconstant, you need to first decay all the values using the old T, or re-heapify all the values using the new T. Note that re-heapifying means the T change is applied retroactively by varying amounts depending on the last update of each value, so decaying all the values is the better option.

There is probably a way to calculate a good T from the cache size.

Thoughts on Adaptive DLFU caching. --DonovanBaarda, Sun, 30 Apr 2023 10:47:35 +1000 reply

Trying to find a decent metric for dynamically adjusting T is hard. There are lots of performance metrics that indicate when the cache is performing badly, but they don't tell you if it's because T is too high or too low. Keeping more metadata per entry to give us long and short history would do it, but it adds storage and complexity. Another option is maybe we can use different T for the cache and extra metadata history to get a signal, but I'm not sure if it is possible or will give any meaningful signal.

Thoughts on adding counts for cache misses. --DonovanBaarda, Sun, 30 Apr 2023 11:05:10 +1000 reply

On a full cache and extra-metadata miss we currently assume the new entry's count was zero and increment it to one. However, this is incorrect and biases the cache against new entries. All we really know is the entry's count is less than the extra-metadata min count, which for large T could be quite large. We could assume it's count was equal to the min-count before incrementing by one. This has the nice property that the avg count will tend towards T as the cache warms up, instead of it depending on the cache hit rate (due to counts "leaking out" with the flushed entry). Alternatives are assuming it's min/2, or some ratio of min assuming the distribution is exponential.

Thoughts on improving cache O(). --DonovanBaarda, Tue, 05 Sep 2023 11:36:11 +1000 reply

Using binary heaps priority queues for the cqueue and mqueue means cache hits are O(1) and cache misses are O(ln(N)). For N lookups at a fixed hit rate this means the cache updates cost O(N.ln(N)). This can be done computationally cheaper by not trying to keep any sort of expiry order, but instead just keeping a "buffer" of free entries and periodically going through all the entries and purging any with a hit-count below a threshold to free up more space. The threshold can be learned/adjusted by something like a PID controller that targets keeping the cache 90% full. The purge could be triggered when the cache gets full, or in a background thread when the cache gets close to full. This makes both hits and misses O(1) with a periodic O(N) purge. The cache would get full and need a new purge after a purge down to 90% after N/10 cache misses. Amortising this over N lookups with a fixed hit rate would be (missrate*N/(N/10))*N = 10*missrate*N, which is O(N), or O(1) per lookup, which is much cheaper than maintaining binheaps which is O(ln(N)) per lookup. The price is only averaging 95% utilization of memory. Risks are the threshold estimator could over/under purge and/or oscillate if not done right.

quickselect for expiry --DonovanBaarda, Thu, 28 Sep 2023 13:04:14 +1000 reply

An alternative to a PIDController? driven periodic evictions QuickSelect? can be used instead. This allows you to select the k lowest count entries with O(N). This means it's in the same order as the PIDController? giving amortised O(1) hit and miss lookups. The PIDController? would still be quicker, as it uses a single scan through all the entries and QuickSelect? averages 2 scans through the entries, but QuickSelect? will always find the desired elements to evict and avoids PIDController? tuning problems/pains.




subject:
  ( 2 subscribers )