To keep things simple and manageable, the programmer is forced to assume in most of his code that execution happens sequentially, and that the global persistent state of a web application changes atomically and instantly when the script finishes executing.
This happy sunny world ends as soon as performance considerations appear. A 100-millisecond script execution, which is not quite unheard of, means that a fully sequential web site can only service 10 hits per second on average. If this is all that is required, then this is fine. However, popular websites might have to service more than 20 million hits per day, which amounts to 231 hits per second or about 4 milliseconds per request.
Needless to say, even serving static pages does not bring you close enough to that 4 milliseconds. Wikipedia, with its 187 million hits per day on the English version alone, would have to accept an incoming connection, fetch the appropriate data from the database and respond to the request within 460 microseconds.
When you reach this level of speed requirements, single-machine optimization reaches its limits—not even assembly-level driver optimization by hardware experts could bring out that kind of performance. The only way to improve performance sensibly is to add more hardware. If you have to service 20 million daily hits but can only handle 100-millisecond responses, you need about 23 identical servers (and unless your business model is insane, those million hits are bound to pay for 23 $300/year servers quite easily). In fact, you could even decide that 23 processors (in six quad-core servers) are better.
That, of course, would be in an ideal world where those 23 independent servers can churn out the correct responses full time. However, the original issue was not a hardware issue, but a design issue: the source code assumes that all requests occur one after another. Adding more hardware to handle several requests at once breaks that assumption and leads to unexpected behavior in the program.
The New Bottleneck
So, in order to keep the program well-behaved, the actual persistent state is still stored and updated in a single place, usually referred to as the master database server. The old bottleneck of accepting connections, and responding with correctly crafted documents is gone, but a new bottleneck appears: there is still only one server to handle all requests related to the persistent store.
The issue of delegating persistent storage to several servers is a complex one. The easiest part is delegating read access, but keeping write access on a single server: this creates slave database servers, which are updated with the data from the master server and allow read-only access to lighten the load of read requests on the master. This is not trivial to do, however: first, cross-server transactions do not exist (you cannot read from a slave server then use that value to update the master server, or at least not atomically) so every potential mutator request must either work with the master server alone or accept losing atomicity, and both solutions require changes in code to redirect read requests to appropriate locations or make non-atomic modifications acceptable.
Quick note: it is not necessary to use the same kind of server for the master and for the slave. For instance, you could use a normal database for the master, and a caching system (such as memcache) for the slaves. This way, you get the optimization benefits of a database for concurrent modification handling (in mutators) and the optimization benefits of a cache system for read handling (in accessors).
Then, there’s the question of delegating write access. Sometimes, it’s easy: if the data you need is made of large independent blocks, such that the time of finding them is smaller than the time of altering them (for instance, YouTube videos) you can dispatch them to several servers and use a single (possibly distributed) directory to find them. Or, if you have several largely independent sets of data (such as independent forums, or the “Posted Items” and “Profile Information” on Facebook) that never need to be modified or queried together, you can dispatch them to several servers. Or, again, if you have a simple way of splitting entities based on their properties (for instance, based on whether the identifier is even or odd) you can route your modification requests based on that property.
Doing It
Accessor/Mutator separation : the most fundamental analysis that has to be done is to determine which requests are mutators, and which are accessors. Typical web sites run many more accessors than they run mutators. If you have a static analysis tool that can determine whether a certain code path or controller ever performs a modification, you’re lucky. Otherwise, you can either write your code to handle this at some granularity level, such as functions (my personal convention is to prefix all mutator functions by “do”, for instance doLogin($username,$password) might change the database), modules or URLs (a classic way of doing this is to login as a different database user when inside an accessor URL, one without the rights to perform modification queries); or you can rely on empirical analysis by logging all queries performed by every URL access over a certain period of time. If a certain URL visit frequently results in a mutator query (or has recently done so), then run it as a mutator, and otherwise run it as an accessor (in the odd case where an URL runs as an accessor but attempts a mutator query, re-run it from the very beginning as a mutator).
Immediate/Delayed separation : the next step is to determine how soon a certain modification should be visible to readers. The user is less capricious than the code: a mutator has to see the current state of the application in order to work correctly, but users can usually handle a state that is a few seconds or minutes old without problems. This allows the master server to delay the propagation of some changes for a short while, which can be pretty handy if it’s under a heavy load. These changes will still be available to other mutators, of course, but the readers will see an older version. Some modifications, however, have to become visible instantly: changing one’s password, for instance, should enter into effect immediately, while indexing a new item within a search engine may take a little bit longer to happen before the user notices. The benefits of a query-based cache, such as MemCache, are obvious here: instead of pushing the data from the master server to the slave server, you can have the web server itself invalidate elements of the cache related to the newly updated piece of data without adding any load to the master server unless the data is actually accessed (or the web server could fill the cache with the appropriate value).
Transaction separation : when a certain process has to perform writes, not all reads that occur before those writes have to be part of a transaction. For instance, if visiting a page updates a “this user is currently visiting…” field in the table, then that update is independent of any reads that happen when serving that request (so, for instance, it would make no sense to read the visited page from anything but a slave server or cache). A typical means of separating transactions is to wrap them in functions that perform all transaction-related work by themselves, either as procedures stored on the database or as normal functions in the host language.
Memoization Trees
A given web page (or any HTTP response, for that matter) is composed of three distinct yet interleaved sets of data:
- Constant data (views, internationalization libraries), which only changes during major events such as software updates.
- Program state data (databases, filesystem), which changes when the software alters it.
- External data (RSS feeds, FTP, the current time), which can and will change randomly.
An ideal cache system stores data in such a manner that, when a request hits the cache, only minimal work is required to combine the cached data with external data and obtain the correct response. The ideal piece of software is therefore:
page(args): state_data = get_state_data(args); external_data = get_external_data(args); return frobnicate(state_data,external_data);
At which point, caching mostly amounts to memoization of the state data.
This brings up two problems. The first is that the code seldom looks like the above: usually, the external data is intermingled with normal data at several points in the code, which makes a full caching impossible. But that’s not really a problem—one could insert a caching layer between the database and the program, and just trigger caching based on the initial arguments. The real problem is that this kind of caching is not granular enough, because it caches the entire state for a given set of arguments. If the arguments change by even a single bit, the cache becomes invalid, even though 95% of the entire cached data was still valid for the new set of arguments.
Another way to look at the final document, is to see that it was generated from a tree of operations that start with the three types of data as the leaves, have various pieces of code as the branches, and yield the final page as the root. Displaying a page with different arguments results in a forest of such trees, except that some branches may be shared between the two trees (which branches are shared depends on which arguments were changed).
So, if every branch that depends only on non-external data stores both its result and the arguments which led to this result, displaying a tree will not have to run all the branches that it shares with a previously evaluated tree. Caching policies may then eliminate branches that are examined too infrequently while pushing forward those branches that are evaluated often. Statistical analysis over a long duration might even determine that certain branches deserve to be cached while others don’t, and lead to introduction of caching code at appropriate points.
The net result that, instead of having a single cache layer, there are several layers of varying available size and flush frequencies. In extreme situations where the memory available for caching is infinite, every single part of the program is cached (and is therefore computed only once).
In practice, dividing the view/controller section of a web site into a hierarchy of blocks helps enforce this: cached blocks have their HTML pulled from the cache and displayed, while uncached blocks compute that HTML and add it to the cache. HTML which appears on several pages is cached only once and then reused on all other pages.
Invalidation Cascades
The problem with having several layers of caching is that it makes cache invalidation harder. Relying on classic timeout means that every update takes several timeout lengths to traverse the entire set of layers, which leads to smarter invalidation techniques. The first question to be asked is whether the invalidation should propagate when queries happen, when modifications happen, or as a side operation.
My suggestion is, for every node in the memoization tree (which is actually a directed acyclic graph made up of the merged tree of a forest, welded along their shared branches), to store a list of direct super-nodes that use it. When the data directly below this node disappears or changes, make the node disappear and propagate this disappearance to all its super-nodes based on the list. This can be done instantly (so that all nodes that depended on the changed value disappear) or performed asynchronously by an invalidation process that keeps a list of “to be invalidated” nodes, depending on the needs.
The alternative is to perform the test at run-time, by storing for every node a list of state-level triggers (such as “property Foo of object ID=bar was changed”, possibly with wildcards) that can result in invalidation. This list of triggers can be constructed at memoization time or (even better) at compile time. The memoized value then keeps a set of timestamps corresponding to each trigger, and compares those timestamps with the latest execution of those triggers. The update process then merely traverses the list of triggers and determines which ones should be refreshed based on what the update is really doing.
Hi. I'm Victor Nicollet,
0 Responses to “Perfect Caching”