The standard behavior of most cache system follows these steps:
- Attempt to read needed data from the cache
- If data is missing, compute it and place it in the cache
- Return the data
This is a fairly streamlined process that’s easy to add to almost any single algorithm that constructs data. The cache could be local (it’s part of the application, or even of the current function call), it could be dedicated (memcached), or the data might be persisted back to the database (such as adding the number of files in a given folder to the folder object itself instead of counting them every time).
The root of this strategy is the principle of memoization: if a function is pure — that is, calling it with the same arguments twice will return the sale result twice — then you can place such a cache in front of that function so that it will only be called once for every argument.
Memoization obviously found its way into RunOrg, because it’s literally a one-word optimization hint that trades memory for performance where it matters. In practice, in a web application like RunOrg, the only really costly computation is sending requests to the database, which is by definition not pure. Still, I can usually expect that for the duration of a single HTTP request, the database contents will remain reasonably stable, so I can create a temporary memoization cache when the request starts and drop it when the response is sent. Actually, I’m using a slight variation on standard memoization which is batch memoization: in order to improve performance, queries for objects A, B, C and D are represented as a single batch query to the database asking for the list of four objects. With standard memoization, if I asked for objects A, B and C, then for objects B, C and D, then six objects in total would be requested (because those two lists are different). By extending the memoization algorithm to know that list elements are independent, I can have the second query ask for only D, and retrieve values for B and C based on the first request.
Outside of this situation, however, functions can hardly be considered pure, so special steps must be taken to keep the cache up to date. This results in three common strategies.
The first one, which is the cache expiration strategy, is to give up on data freshness and decide that data in the cache will survive for a fixed duration regardless of whether the actual data has changed or not — so, instead of declaring that data is always up to date, it declares that any change will be visible everywhere in less than X seconds. While somewhat weak, this strategy is particularly effective because it does not need any kind of knowledge of the relationship between the cache and the underlying persistent data — the only connection between the two is the compute-if-missing steps outlined above.
Once you decide to handle the relationship seriously, two more strategies become available. The cache invalidation strategy is activated when the underlying data is changed, and invalidates all cache items that are dependent on that data. Thus, subsequent requests for those items will trigger a cache re-computation and always serve fresh data. Of course, this means that the cache system can easily tell which items should be invalidated. This is fairly easy in a one-to-one data-to-cache mapping, but as pieces of data can be mixed and matches into various cache items, this requires an unusually complex architecture to handle.
A nice design trick to keep in mind is that you don’t always need to find all that data — sometimes, you can simply «lose» it: for instance, web caching uses a combination of expiration and invalidation strategies. When a file is sent to a browser through HTTP, it sometimes carries a header explaining when it expires. This is useful when all the pages on your web site use the same CSS or JavaScript files, because then your visitors will only need to download them once and will use the cached versions from the on. To handle changes in CSS and JavaScript files, some web sites rely on cache expiration (a few minutes or hours of cache lifetime, so that any changes are detected soon enough) while others use cache invalidation. Obviously, the web site can’t go and notify every single browser that the CSS files have changed (especially since some browsers are closed or offline) so it will simply lose the data: while the CSS file named style.css?0001 will remain in cache for up to a month, the pages on the site are now asking for style.css?0002.
The third, the cache refresh strategy, is a variant on the former: instead of merely invalidating the cached data, this strategy computes the new data and places it directly into the cache. This is necessary when the data is frequently accessed and the computation is long: if one hundred users come asking for the data while it’s invalidated, then all of them will compute the data as part of the “if missing, compute” step of the caching process, which will probably bring the server to its knees — what people call a stampede — so the only safe thing to do here is to keep data in the cache at all times and replace it with a more recent value whenever necessary.
Another nice design trick is to use a flexible expiration date to turn a cache invalidation strategy into a cache refresh strategy: instead of invalidating the item by removing it from the cache, merely set its expiration date somewhere in the past. Then, to avoid stampedes, whenever an user detects that the cache is expired, they set the expiration date to sometime in the future and start computing the data. So, the first reader will notice a delay (as his request will reconstruct the cache), subsequent readers will instantly read the old cached version without triggering a stampede, until the first request ends and the cache contains the new version of the item. To choose between the two: if the event that triggers the refresh provides you with data that improves the time required to update the cache, then update it, otherwise merely invalidate it and rely on your stampede protection to do the job.
There’s one last situation that makes caching complicated, which I’ve recently had to handle with the RunOrg application — indexing. Suppose you have a huge amount of data that you need to wade through, nicely split into separate objects that each have their own responsibility (my profile, my membership information, my participation to event X, my answers to poll Y…) but you sometimes need to virtually aggregate all that data and traverse it to retrieve only certain parts of it: give me the name, premium-or-not-premium membership status and answer to “T-Shirt Size” in poll Y, sorted by the registration date to event X. Yes, that’s one of the many queries that RunOrg lets you do (and you can even print out that data to serve as a list of participants). Now, trust me, there’s no sane way to dynamically run queries of the sort on a clean normalized database and still get reasonable performance. So, you need to create an alternate, denormalized representation of all that data and keep it in cache to avoid re-computing it for every request.
The problem with such a cache is that you cannot afford to re-construct a thousand lines of cached data because one user changed their T-Shirt Size answer, so there can be no high-level validity check. Basically, if you can’t trust the cache to be up to date when you run your read query, you lose. The traditional “try to read, if invalid update” approach to caching goes straight out the window. You need a solid cache refresh implementation that pushes the most recent data into the cache as soon as it becomes available.
RunOrg uses a variant — the cache pull strategy. This is merely a small semantic shift, but it’s quite helpful: in the original cache refresh situation, the data model needs to be aware of the cache, because it must actively send data to the cache whenever a change happens. With the RunOrg variant, the data model merely publishes a “data was updated” event that the cache module may listen to and react by refreshing its contents. So, the knowledge of how to extract data from the model and place it in the cache now belongs to the cache module instead of being spread over both the data model and the cache module. This not only makes the code cleaner — the data model becomes cache-independent and thus easier to read through, with cache modules being tacked onto it using the event system — but it also lets the cache react to events from different data models: an item might be updated when the profile changes, or the membership information changes, or the event X participation changes, or the answer to poll Y changes… and will have to read data from all four to compute the new value anyway.
Obviously, the cache pull strategy is a more complex architecture than the previous one:
- You need an event system — the entire contraption hinges on the fact that a cache module can listen to the changes that happen in a data module.
- Your cache module must track the dependencies of each item, in order to update that item when it receives a change event for one of its dependencies
- You need asynchronous processing, as pulling values for dozens of items simply cannot be done as part of the standard HTTP response cycle
- You need to follow clean multi-process patterns to handle simultaneous updates of some items
Still, given the performance we achieve with this approach, and the clean code that results from the underlying events-and-async structure, the results are certainly worth the efforts.
RunOrg is my Start-Up ; we provide an online tool that helps associations, unions, organizations and communities manage their members, contacts, activities, events, knowledge and online presence.
Recent Comments