Tag Archive for 'Architecture'

Get Rid of Constraints

We add constraints (UNIQUE, FOREIGN KEY…) to our databases in order to achieve a tradeoff between how easy it is to write data to the database, and how easy it is to read data back. Constraints make writes harder because the application must ensure that those constraints are verified, and react appropriately when they are not. They make reads easier because they allow the application to make assumptions about the data it receives.

Sometimes, the benefits to data reading are not sufficient to offset the increased difficulty of writing it. Let me provide an example.

In a project management tool, users on the free plan are only allowed to have one project. Premium users can have any number of projects. An obvious solution would be to constrain the database so that a given  user can only have one free project at once. This would guarantee that any attempt to connect a second free project to an user would fail.

However, the possibility of non-technical failure of “connect a project to an user” will have to be handled in a variety of unusual locations:

  • Deleting project A, creating project B, and cancelling the deletion of project A would cause an error.
  • Merging two free user accounts would be interrupted because both already have a free project.
  • Downgrading from premium to free would break if the premium account had more than one project.

The good news is that your database will stop you from doing anything that would break the constraint, so you do not have to worry about a glitch letting free users have two projects. The bad news is that you now need to handle that failure specifically in three different locations in your interface: cancelling a deletion, merging two accounts, and an account downgrade that might happen when the user isn’t even connected. It’s possible to handle all three, but it’s quite repetitive, and the user interface for handling these is probably going to be different in each case.

A Posteriori Constraints

A better solution would be to allow users to have any number of projects in the database, but detect free users that have more than one project (because of cancelling a deletion, merging two accounts, or not renewing their premium subscription) and force them to take corrective action through the user interface. For instance, users with inconsistent accounts are greeted with a list of their projects and the possibility to keep one of them unlocked. The other projects are then locked and can only be unlocked by deleting the currently unlocked project or paying for a premium subscription again. You can even provide some leeway by letting them keep full access to all projects for a week before forcing locks on them, or let customer support temporarily unlock projects in cases of dire need.

The beautiful aspect of this is that the code required to solve the inconsistency is not repeated. Regardless of how you end up with more than one free project, the feature that corrects this is always the same. This means the application is smaller, which leads to fewer bugs and a more consistent user interface.

Of course, if non-subscribing users already have a project and try to create another, they are told to subscribe. Lack of constraints in the database does not mean there are no constraints or clever up-sells in the interface. While RunOrg does prevent users from exceeding the storage ceiling by uploading new documents, all other ways of breaking past the ceiling (such as un-deleting documents or lowering the ceiling) remain unaffected, and our administrator team only takes corrective action when there is a clear abuse of the software — you can stay a few megabytes above the ceiling for as long as you wish. As for the limit on how many users can join a community, there are even fewer restrictions — it would be quite harsh to prevent the 2011-2012 members from joining only because the 2010-2011 members have not left yet — so we ask our customers to return below those ceilings as fast as possible, and only do it ourselves if there is abuse.

There is a conceptual shift here from No Inconsistent Data In The Database to Detect And Correct Inconsistent Data. This step can be a bit difficult to stomach, especially if you are used to letting your database keep your data clean for you, but for many constraints the shift is actually worth it.

In the early design phase, we moved RunOrg from a standard MySQL architecture over to CouchDB, a NoSQL solution that provides absolutely no constraints beyond having a unique primary key. There are currently no foreign key constraints in the system, nor are there any unique constraints beyond enforcing many-to-many relationships (and these are fundamentally primary keys). This seemed like a huge challenge at first, but the system has grown to support complex data structures and business rules that go beyond standard document-based CRUD, simply by accepting that pieces of data can be missing or be duplicated, and applying an a posteriori solution, possibly involving user intervention, has worked for us so far.

Of course, what is true for document databases is not necessarily true in the SQL world. While we impose no constraints on the relationships between documents, we are pretty strict about the structure of documents themselves (especially since they are being read from OCaml, a strict static language), and there is no simple way of representing “Foo holds a list of Bars” in SQL without a foreign key (in CouchDB, it’s standard JSON…)

This is the important information here: those are structural constraints, they define a data structure in the database that must be respected by your implementation, as opposed to non-structural constraints that merely define relationships between higher-level entities and must be respected by your users. Identify the latter, and consider whether you can shift them to a posteriori constraints instead.

Article Image © Calsidyrose — Flickr

Dealing With Huge Projects

Right now, I’m the only developer working on RunOrg, which happens to be a 45k-line project written in OCaml. According to a common terseness observation, this is equivalent to managing a 135k-line project in Java. Alone.

OCaml shares a problem with many dynamic languages : it’s very expressive, but there is no general consensus on what architectural best practices should be, so there are literally dozens of different ways a given feature might be implemented that cannot be discriminated on anything but taste. This leads to a variety of unique design choices throughout the application which, despite working well with each other, cause programmers to «discover» new architectures every time.

In the end, I believe that the philosophy of using the best tool for every job can easily be taken to painful extremes if you are not careful. You encounter a new problem, pick an unusual but well-adapted solution, and it makes perfect sense to you, so you move on. Months later, you come back and the solution does not make sense anymore because you have forgotten a small detail about how it works or why it was done this way, and you have to hunt that small detail down by reading the code. I’ve pretty much solved this anti-pattern, so I’ll come back to it later.

The main point I’m making here is that for every project, there is an ideal mudball of code that happens to perfectly implement everything without bugs, all in a single gigantic file, and you cannot write this mudball. For a human, there’s no way to manage anything mudballish past a few hundred lines because you cannot wrap your mortal mind around the possibility that every line might interact with any other line in the project… so, as an architect, you slice up the mudball into more acceptable bits that you politely call «modules» in order to reduce the number of things any given line might interact with. We reduce the amount of data we need to cope by adding big «you don’t need to think about this» signs everywhere (and making sure the signs don’t lie, obviously).

And on a small scale, this works, because you only have a dozen modules and it’s enough to fit in your short-term memory. RunOrg currently has 260 high-level modules, and several times that amount in sub-modules. No UML design, no matter how comprehensive, can make all those modules fit in my mind at once. I must find some «you don’t need to think about this» signs before I can move on.

There are mostly two ways of slicing a given project into modules: horizontal and vertical.

Vertical slices happen when there are dependencies, and modules look like layers stacked on top of each other with each layer being allowed to access the layers below. The RunOrg project architecture actually starts with a clean set of vertical slices : the controller layer deals with HTTP actions by using the view layer and the model layer below it, but the view layer cannot access the controller layer, and the model layer cannot access either.

Horizontal slices happen when there are absolutely no dependencies, and modules look like books cleanly arranged next to each other on a shelf. This usually happens when those modules represent the same concept for different purposes. In the RunOrg project, the controller layer is divided into many action modules, with each of these modules handling the HTTP requests for a limited part of the application. For instance, there’s a Login module in charge of handling HTTP requests related to logging in, and a File module in charge of handling HTTP requests related to uploading files. The concept is the same (handle HTTP requests) but the purpose is different (logging in, uploading files). And there is no need for either module to know about the existence of the other.

Knowing whether slices are vertical or horizontal immediately tells the programmer about what dependencies should be considered for that slice. And it is all recursive : the Login module of the controller layer is further divided into a Login_common bottom layer for common definitions, the root Login top layer for binding everything together, and an intermediary layer of horizontal Login_form, Login_signup, Login_lost slices dedicated to the various independent aspects of logging in. The naming convention helps identify the pattern used.

In practice, the slices do not necessarily map to actual namespaces or modules because, especially at very low levels, the granularity involved to segregate the two would be too verbose. For instance, while it may appear that the controller layer is made up of modules that are all horizontal slices, this is not the case : while the actions (functions that respond to HTTP requests) are indeed independent horizontal slices, the layer also contains helpers (functions that provide common functionality to actions) that follow a vertical layering, and a given module will usually contain both actions and helpers indiscriminately.

What is relevant here is that the patterns used will let you determine easily what kind of slice you are dealing with. And a pattern is a named convention (action, action helper, view template, table) that is respected by relevant pieces of code, in terms of :

  • Location : where is it within the module and file hierarchy, and in relation to other constructs within the same module ?
  • Structure : how does the code look like ? What parts of the pattern are expected to be changed and what parts should always be the same ?
  • Type : what is the signature of the module, class or function defined by the code ?
  • Name : is there a common suffix or a way to give a name to entities following the pattern ?

These are guidelines, a pattern should usually have at least one of these, and the more the better, but you don’t have to implement all four if it is counter-productive to do so. Also, a pattern should define a dependency rule : it is generally understood that two pieces of code that follow the same pattern have a dependency dictated by that pattern, and that dependency is usually a horizontal slice.

The important thing about patterns is that they are not an external influence on your project. If you limit yourself only to those patterns that are dictated by the Gang of Four book, or by the framework you are using, then you will miss out on the many patterns that will emerge naturally within your application. Quite to the contrary, it is essential to identify as often as possible the patterns that appear in your code, clean them up by providing both a name and conventions of location/structure/type/name, and apply them wherever necessary. This will make your code more easily recognized by the programmer, because there are only a handful of fairly generic concepts to learn (the patterns) and everything else can be understood by finding out what patterns are used. Even better, familiarity with patterns places a  «you don’t need to think about this» sign on the parts of the pattern structure that stay the same, because they never change.

And now, I have cleverly returned to my previous point : the inevitable conflict between the use the best tool rule and the use the same pattern rule.

Using the best tool creates the risk of writing code that is very difficult to understand later on, because there are too many special cases. Using the same pattern everywhere causes problems when the pattern is ill suited to the problem being solved, such that it creates code that is too long, too repetitive, or too unsafe.

In my day-to-day routine, I follow the use the same pattern rule until it becomes too painful. Then, I just change the pattern to make it less painful to use, and propagate the changes to all the places where it is used, which in turn was made possible by the fact that I did use the same pattern everywhere.

Obviously, I don’t have a pattern for everything. So, whenever I encounter a problem for the first time, I go with the best tool instead. Once that kind of problem is solved several times, a pattern will emerge and some refactoring will happen.

Article image © holycalamity — Flickr

Rewrite Your Code

Writing code relies on four kinds of decisions:

  • What algorithm can implement this feature?
  • How is that algorithm best written in that specific language?
  • What platform quirks and subtle edge cases must be accounted for?
  • How does this code fit in with the rest of the application?

Regardless of team experience or preliminary analysis, some of these decisions will be incorrect. Maybe the algorithm failed to take into account the unusual distribution of real-world data ; maybe there was a better way to write it ; maybe there’s a subtle bug that will not be discovered for weeks ; maybe a possible code reuse has not been identified during the design phase… or maybe the customer requirements that the feature was based on were not actually adapted to the customer needs.

Such bad decisions get in the way of users, but they also hinder developers, who have to regularly work around existing bad decisions, which in turn causes more bad decisions to be made in recurrent “lesser of two evils” situations.

It is a good idea to go back on your bad decisions and make new ones instead. They will not necessarily be good, but at least they will address some of the problems with the old ones.

Don’t try to go back on everything at once. Most of the time, the shortcomings of a decision can be identified in hindsight, change too many things at once and hindsight will be lost. In particular, throwing away non-trivial portions of code (anything beyond a single function) in order to rewrite it from scratch is quite risky, especially since it might also discard good decisions that would be hard to retrieve.

Don’t make your code difficult to change. Going back on your decisions will involve rewriting code. Lots of it. So far, most of the code in the RunOrg project has been rewritten at least three times. Make sure your language, frameworks, libraries and unit tests all work together to make it easy to evolve specific parts of your code to change decisions. The worst situation for a project to be in is code freeze — changing code is forbidden because it’s too risky and it might break something. If you suspect that your project might be heading that way, immediately drop everything you are doing and bring your project back to an acceptable state ; if you are not allowed to do so, make sure you send out a warning to anyone who might need to know.

Don’t make too many decisions. This is usually spelled out as YAGNI : You Ain’t Gonna Need It. If there is currently no need for a given feature, other than the fact that it should remain possible in the future, then don’t implement it. Implementing it will involve making many decisions about how it should happen, and lack of practical application will increase the odds that those decisions are wrong.

Don’t be afraid to go back on huge decisions. Weeks ago, an initial decision we made on the RunOrg project turned out to have huge performance implications. I was faced with two choices : keep that decision, and manually optimize the locations where the performance suffered the most (this involved manually handling caching and batches) ; or go back on that decision, re-architecture the entire database access system and propagate those changes throughout literally half the project, in order to allow automatic caching and batch construction in ways that manual optimization could never allow. The rewrite took me four days, with some aftershocks being felt several days afterwards (strangely enough, changing 20k lines of code resulted in only four fairly obvious bugs).

What does your decision-making process or pipeline look like? What does your decision postmortem and reversal process look like? How often do you go back on your decisions?

Article image © Barb Crawford – Flickr

On NoSQL Design Tips

Cloud computing expert Joannes Vermorel posted a few design tips for your NoSQL app two weeks ago. As a short anecdote, Joannes is the man who (through a rather unintended and surprising chain of events) prompted us to move from SQL to CouchDB six months ago on the RunOrg project. So, here are my very own thoughts overlayed on top of Joannes’ five-point article :

You need an O/C (Object-to-Cloud) mapper : From my experience, almost all read-only code in the application involves some form of object graph traversal : select a root object, extract a list of related objects, filter the list based on a predicate… Most of the traditional Object-Relational impedance mismatch is a consequence of SQL’s insistence on performing most of that traversal on the database server using joins in a conceptually parallel environment (every non-discarded row is going to be treated the same by the join) where object-oriented strategies dictate that each object should be able to react differently based on its type. So, SQL-and-Objects leaves you with a sprinkle of duplicate traversal logic across tiers, and some low-performance application-layer graph traversal where polymorphism requires it. And by low-performance, I mean two hundred one-item requests in a row :

foreach (Document document in documents)
  owners.add(document.getOwner());

And there’s no simple way of dealing with this, short of dictating that documents don’t have owners, they have owner identifiers that can then be used to retrieve the actual owners from the database (which may be the case for persistent documents, but is a leaky abstraction that prevents the rest of your application from creating temporary in-memory documents). With a few exceptions, in NoSQL, there are no joins. On the one hand, this reduces the expressiveness of the data storage layer, but this also has the ironic consequence of making NoSQL more optimal for application-driven graph traversal. Adding insult to injury is that most key-value stores implement bulk queries which, when appropriately leveraged by the mapper, transparently improve performance. Breathe (the CouchDB mapper we use) does this :

let owners = documents |> Breathe.batch (fun doc -> doc # get_owner)

Since Breathe is purely functional and based on monads, any requests performed by the many doc # get_owner calls can be merged into a single CouchDB request using the bulk API.

Another reason why mappers are necessary is that unlike SQL, which works on an extremely simple “send query, get results” basis, most NoSQL solutions involve complex idioms using basic primitives. For instance, when dealing with CouchDB, you cannot press the “use transactions” button and expect it to work. CouchDB transactions involve posting a document, receiving a collision notification, downloading the new version of the document, applying the transformation again, posting the document again, and so on until you either run out of retries or no collision happens. And collision detection involves keeping a revision number around and passing it to the API, so your code also has to deal with whether there’s a cached in-memory copy of the object that you can use to grab the revision or if you will have to actually query the database. The public API provides nothing to relieve this pain, so it is up to the mapping system to do this :

let publish id =
  let publish article = { article with published = true } in
  Article.DB.transaction id (Article.DB.update publish)

The mapper will grab the initial page value from the database if it was not already in the cache, try the update 20 times if collisions happen, and place the latest version of the page in the cache. Using raw API calls, this would take ten lines and probably contain many bugs. And did I mention that the function above can use the bulk transaction API if placed in a loop ?

Performance is obtained mostly by design : which I would rephrase as “your O/C mapper abstracts operations, not architecture”. In short, there’s no way for your application to shield itself from the architectural consequences of NoSQL storage. For instance, CouchDB just doesn’t do multi-object transactions, so this is something that the application will have to take into account — this means that most operations are idempotent so they can be safely retried, and most cross-object invariants are expected to be eventually consistent.

And, indeed, you cannot write an application and then leave it to the database administrators to create an index wherever necessary. The good news is that most NoSQL solutions, by their simple design, make it hard to do any inefficient operations (it takes one line to do a full table scan in SQL, but it’s a lot harder to do with NoSQL, and a lot more obvious when it happens — why am I querying _all_docs again?)

The 20 updates/second limit should be taken with a grain of salt. NoSQL performance is certainly predictable, but sweeping generalizations never apply and there are many bottlenecks that you can hit in any particular order — application-database bandwidth, database-disk bandwidth, database processor usage… right now, CouchDB handles 130 updates/second (an update is a GET-PUT cycle), with the bottleneck being that the updates are synchronous on the application side, so it doesn’t send as many requests as the database could possibly handle (if you run two application instances, you can get 260 updates/second). This is a fairly healthy bottleneck to have, because it means adding more application instances solves it, and this just happens to be our scaling strategy ;-)

Go for a contract-based serializer : automatic serialization is great for one-shot messages that run between instances of the application, but a pain as soon as you need to persist them and, consequently, have the storage format evolve along with the application.

I have no opinion on the XML-vs-JSON debate, though I use JSON myself because it fits with both the database and client layers of my project.

In general, I find the use of NoSQL to be no excuse for making a mess in the database. Sure, you’re not constrained by a schema anymore, but data in the database is what you are going to manipulate ! If there isn’t a clean description of what you can find in the database, that data might as well be lost, because you will not be using it. Using application-wide conventions on data storage and representation helps developer A manipulate data generated by developer B without requiring a meeting to get everyone on the same page. Storing data in an opaque blob representation is fine for short-lived situations, obscure and rare edge cases (such as when the Lokad.Cloud queue offloads large objects into blob storage because it has to put them somewhere), but most of your data should have a clear documented schema. Your database does not need it, but your team does.

Entity isolation is easiest path to versioning : You Ain’t Gonna Need I (yes, I disagree with Joannes on this one). Unless something went terribly wrong with your code base, it’s  to duplicate your entity class when the two versioning paths actually start to diverge (and even then, you can certainly keep some common bits factored out nicely) instead of doing so from the very beginning.

With proper design, aka CQRS, needs for SQL drop to near-zero : Well, let’s face it: SQL provides a toolset that goes beyond anything any NoSQL solution in terms of general flexibility and expressiveness (including the ability to simulate key-value stores using blobs,should the need arise). NoSQL solutions remain niche solutions to niche problems because this is the way they are designed, possibly out of a fear of inadequacy and the general message “NoSQL should be used where SQL is not good enough” — most NoSQL users still use an SQL database as the primary storage, and dump data into NoSQL for a small subset of queries that need the performance boost. Companies like Lokad or RunOrg, that embrace NoSQL-only architectures, are still rare and still struggling to understand how, exactly, some things should be done in NoSQL that would take one line of SQL code. You don’t see Design Tips for SQL articles being written in 2011 ;-)

Even with CQRS, requirement changes have a much greater impact on NoSQL application architecture than they would with SQL, because SQL mappers despite all their flaws still manage to provide a significant amount of shielding. What used to be a matter of adding a few tables and doing the right joins now turns into an uphill battle of propagating the appropriate de-normalized data into the right places to allow for an index to be built (which, by the way, leads me to believe that Aspect-Oriented Programming might be more adapted for NoSQL than standard OOP) that simply isn’t possible without a nearly extremist agile culture. Sure, with CQRS and event sourcing, this isn’t a losing uphill battle we’re fighting anymore, but it’s still uphill. I’m pretty sure the reason why Joannes does not see things this way is that Lokad has an extremist agile culture.

Object-Oriented For the Win.

Readers, beware: this is an opinionated piece. Feel free to curse appropriately in the comments below, should you find the need to do so. I’m writing this because of yet another encounter with an object-oriented zombie, and one without the excuse of being fresh out of school. They all stand united behind the motto that Object-oriented programming makes your code easy to reuse, debug, maintain and extend and I happen to disagree.

Something interesting happened in the Pacific Ocean after World War II: during the war, American soldiers had set up air strips on Pacific islands, complete with little control towers that had radio stations. And the natives observed that those soldiers would speak into a weird piece of metal, and a giant iron bird would land from the sky and deliver food and supplies. Then, the war ended, the soldiers went home, and the natives tried to bring the giant iron birds back by building shacks that looked like legitimate control towers, and spoke strange words into rocks that looked like legitimate microphones, and it worked!

No, it didn’t. Cargo planes did not come back because it was not the shape of the microphone and antenna that brought in the planes, it was their invisible electromagnetic properties. These are the cargo cults: imitating the outside appearance of things that work and expecting the imitation to work as well. The same happens with object-oriented programming — the use of classes, or inheritance, or design patterns is not the reason why good code is good: they are merely tools which happened to be applied by knowledgeable and skilled programmers in order to avoid problems that they knew would happen if they did not preempt their appearance. To look at their code, find out that they used classes, and deduce that classes were responsible for their success is about as rational as expecting to steal the strength of your defeated foe when you eat his flesh. Object-oriented programming can only solve design problems you know you need to solve.

Imagine a piece of old, spaghetti C code written in procedural style, with functions calling each other and accessing global variables all over the place. I suspect many young graduates these days never had to deal with these — and I wouldn’t have either, were it not for my unrequited love for mental anguish and programmer pain. Back to the point: in your mind, draw an arrow from A to B if function B calls function A or accesses global variable A. This is a dependency graph: if you were to change the run-time behavior of A, then the behavior of B would almost certainly change as well. By following all outgoing arrows from the point you altered, you can find every element that could possibly be affected by it.

No programming style or design methodology on earth is going to change this ; whether object-oriented or aspect-oriented or functional, functions call other functions, and changing one part will have an impact on many other parts. That is something we accept, because propagating changes is part of the programming job, and we have even elevated refactoring — the propagation of the absence of changes — to the level of a good thing since sliced bread.

And since we have to track down all the dependencies of a given entity sooner or later, our national sport is to make it as easy as possible. One way is to let the compiler or build process help — compilers detect type mismatches that are usually a hint of non-propagated major behavior changes, while automated unit tests and regression tests handle more subtle changes, and both of these are fairly independent of your programming style. Another way is to willingly reduce the number of dependencies through architectural constraints — Model-View-Controller prevents a model from being dependent on a controller, because altering a controller should not change the behavior of a model. These are fairly common ways to reduce static dependencies.

Then, there are the dynamic dependencies — functions which alter the behavior of other functions at run-time. Remember those global variables from the C program? If function A writes to the global variable and function B reads from the global variable, then calling A will probably change the behavior of B. Runtime dependencies are fairly obvious to map: every time a function writes to a global variable, draw an arrow from the function to the global variable and mark the function as having a side-effect. Then recursively do the same for every function that calls it, and so on until you run out of functions. Your dependency graph just became a lot hairier than it was before.

Such dependencies are the reason why working on the report-printing feature in your invoicing software somehow managed to break the database storage : even though both modules are independent and the static dependency tree for report-printing does not flow into the static dependency tree for database storage, there are a handful of global variables that allow your changes to follow dynamic dependencies and leave a flag unset when it should have been set and ultimately blow up the nuclear silo.

So, when you have global mutable state, then changes can jump from module to module through those pieces of global state. Procedural programming mostly dealt with this by cleanly wrapping up any global state in a module, and having modules interact with each other by means of interfaces that managed to guarantee some invariants on the global state. But still, some problems remained. I am reminded here of the code to Dungeon Crawl, an open-source game written in procedural style, where the module responsible for dealing damage to the player will make frequent calls to the module responsible for outputting messages about the damage being dealt (“The hobgoblin hits you!”). As written, you cannot use the damage-dealing code without displaying messages, so writing an AI module that tries to predict the outcome of an attack is harder than it seems, because it cannot use the existing damage-dealing code to run predictions.

They actually have sludge elves.

Functional programming solved the global mutable state issue by eliminating global mutable state altogether. In the aforementioned example, the damage-dealing module would return the new damaged player along with a list of messages that were generated, and these messages could then be discarded silently in the AI subroutine, or forwarded to the screen in the main routine. I am a big fan of functional programming, but let’s face it: not only is it harder on the brain than good old set-variable-values-everywhere programming, but it does not suit well to situations which have an implicit reliance on global mutable state — such as web servers connected to a persistent database. I look forward to a pure functional friendly database, but it’s not there yet.

Both of these solutions rely on encapsulation: whatever happens to be the implementation of a procedural or pure functional module is hidden, and only available through those functions. This creates a dependency bottleneck, so any changes performed within the implementation can only leak out through the interface in tightly controlled ways — at compile-time, those are called contracts, at run-time they are called invariants.

Object-oriented programming works by deprecating global mutable state. It’s not completely gone, and some programmers yearning for the good old days still try to enjoy some global goodness by using the Gang of Four Singleton pattern, but it’s mostly deprecated. The sickle and hammer of object-oriented programming are dependency injection and late binding, and it uses them to break down the structure of static dependency graph bourgeoisie.

Yes, object-oriented programming is all about building the dependency graph at run-time. Then again, a program that generates machine code at run-time does the same. Object-oriented programming merely provides a set of tools that are easier on the human brain than the extremes of generating machine code at run-time. As an aside, keep in mind that it is not the only toolbox available to you: any language with closures can do the same, and although closures are both more elegant and slightly more complex to handle in the general case, they have found their way into event-based programming because an event handler is just, well, a closure. This is why jQuery passes functions around everywhere instead of requiring the user to implement an IAjaxResultVisitor interface: it’s shorter, but still fairly easy to understand in that context. If you push things far enough, most objects are usually nothing more than dictionaries of closures.

Back to the point: writing a proper object-oriented program with those tools is about the same as creating a statue out of stone using sculpting tools, and inheritance will not turn someone into an object expert any more than holding a chisel turns them into a mutant ninja turtle disambiguation page. How does it happen?

The original problem with procedural programming was that there was no easy way to unplug the damage-dealing module from the message-printing module. Dependency injection means that the damage-dealing module (which is now an object) does not know about the message-printing module (which is now also an object), instead, whoever needs to use damage-dealing code must provide it with a message-printing object that will be used to print whatever messages come up. That message-printing object might actually print those messages to the screen, or it might silently discard them, or it might keep them around and only display them to the screen if a specific event happens, or it might be an unit testing mock, or any other amount of different behaviors. In terms of dependency graphs, there is no static connection between damage-dealing and message-printing: the program decides at run-time what message-printing object should be bound to what damage-dealing object, and can create new objects on the fly should the situation require it.

Aside from that, there are no significant architectural benefits to using objects or classes — although touted as an object-oriented achievement, encapsulation is commonly available in functional and procedural programming as well, and using object.function() instead of function(object) is a matter of taste, not architecture.

As a final word, I am fairly doubtful of the ability of schools to teach their students about good object-oriented programming. They can certainly teach them how to use the tools, but the pains that object-oriented programming is meant to solve only become obvious when you have to work with a large, multi-developer project over a long duration and with changing requirements — anything less, and your pains will be subtle feelings of awkwardness instead. Definitely not something you can learn from. If you have never written an unmaintainable piece of mud, you cannot know how object-oriented programming can keep you from writing one.

You may now dish out punishments in the comment box below. Have fun.

Exception Avoidance

From what I gather, many significant users of Objective Caml are moving away from exceptions, opting to rely on return types instead. To the more traditional C#/Java/C++ worlds, this must certainly sound as a heresy — exceptional situations should be handled using exceptions in order to avoid mudding up the control flow. Right?

Wrong.

Exceptions are a powerful tool that lets the author of a function say «heck, no way I can do this, let’s tell the caller» while letting the caller of that function say «I don’t care about your problems, tell my caller» and so the exception, once thrown, bubbles up through the hierarchy of function calls until it reaches a point where it can be handled. Which, in the vast majority of cases, is in some sort of catch-all handler that basically amounts to «drat, it didn’t work, let’s do something else»

As the writer of the calling function, this makes your life easier: if any errors happen in the functions you call, they will nicely bubble up, so you only have to care about situations where everything works fine. As a human being, you are passing the responsibility for handling that failure to someone else on your team who will happen to write the code that calls your code. And a major problem with responsibilities being passed around is that they end up on the floor.

This is further compounded by the fact that the presence of exception handlers is not adequately verified by compilers, and you end up with code in a mess that can only be solved in two ways:

  • Read through all the code again, remember what functions might throw which exceptions, and add any missing handlers to treat them on the spot. This is hard.
  • Add a generic catch-all handler that silences errors but doesn’t really react to exceptions in an acceptable manner.

Maybe your team does handle all exceptions appropriately in the places where they are needed. That would be quite surprising (where do you work?) but let’s assume for a moment that this is the case.

What happens when a change of specifications mandates that a new kind exception needs to be thrown? Suppose, for instance, that a given database resource may be now locked by an user, making all attempts to access that resource (by other users) fail. How do you change your code to handle this new requirement?

Adding a new exception is perhaps one of the single most painful kind of code change in modern programming, because it is woefully ill-supported by both languages (by design) and tools. It would effectively require the «functional specs guy» to think of all the ways an user might trigger that lock condition and how the software should react, then provide the developers with a detailed plan of action that would let them jump into dozens of different code locations to add the appropriate try-catch block. If one such location goes unnoticed, it will be filed as a bug when a tester or customer noticed («I tried to save my document, but it said document locked and I lost my data!»).

Consider the following example:

let get_age (person : Person.t) =
  if Person.is_dead then raise PersonIsDead ;
  Date.subtract Date.today (Person.birth_date person)

I just added the exception-throwing line — how can I easily detect locations that are affected by this change? There is no way. So, what is the alternative?

Eliminating Exceptions

Objective Caml allows the use of polymorphic variants. You can create a tag on the fly and attach it to a value that you’re returning (the tag is called a polymorphic variant constructor). You can also return tags without attached values, which is helpful when signaling error conditions. Here is how I would rewrite the above:

let get_age person =
  if Person.is_dead then `person_is_dead
  else `age (Date.subtract Date.today (Person.birth_date person))

I used the tag `age to represent a valid age, and `person_is_dead to represent the fact that a dead person has no age. Looking at the compiler output confirms that my function was indeed identified as returning a tagged value, and all possible tags were described by the compiler in the function signature:

val get_age : Person.t -> [ `person_is_dead | `age of int ]

Since the return type of the function changed (after all, it used to be int), all the functions that call it now cause a compiler error, which makes them quite easy to find and correct. Most of them will usually correspond to what the «functional specifications guy» wrote in his document, but the compiler will also find those that were overlooked. For instance, there’s this line that displays the age of a person:

 printf "%s %s (age %d)"
  (Person.firstname)
  (Person.lastname)
  (get_age person)

Seeing this code, but not seeing anything about it in the specification document, the programmer asks the stakeholder in the next scrum meeting, and the latter decides that showing deceased would be enough. So, the code becomes:

printf "%s %s (%s)"
  (Person.firstname person)
  (Person.lastname person)
  (match get_age person with `age a -> sprintf "age %d" a | `person_is_dead -> "deceased")

Pattern matching is a powerful tool from the ML family of languages. It behaves like a switch, but with several added benefits:

  • It extracts any values attached to the tags, and the compiler checks the presence and types of those values. So, if I tried to access the a value from the `person_is_dead tag, the compiler would complain that no such value exists — the type signature of get_age specifies that no value is attached to `person_is_dead.
  • It verifies that all possible tags are handled by the construct. So, if I had to hide the age of all female users, I would return `is_woman and the compiler would point out that said tag is not handled in the above code — an interesting tool when adding more exceptional cases.

Preventing Exceptions

But that is not all. Having the compiler as an ally also helps the programmer prevent exceptions in the first place. For instance, suppose that, on the online profile page of every user, I have a tab that shows a countdown to the next birthday of that user. The code looks like this:

let show_user_profile person url = (* ... *)

let show_birthday_countdown person url =
  printf "%d minutes left to age %d"
    (get_minutes_to_birthday person)
    (get_age person)

let profile_page person url =
  (* This renders both tabs, then checks which one is 
   * active based on url and renders its content 
   *)
  Tabs.render url
    [ "profile",   show_user_profile person ;
      "countdown", show_birthday_countdown person ]

Obviously, a lot of functions called to render the contents of that tab will have to handle exceptions when the user is deceased — but the sane thing to do would be to hide the tab in the first place. If you don’t try to render the tab contents, no errors will happen while rendering it… The problem, of course, is that if we check whether the user is alive outside the tab contents (in order to decide if the tab is present) we still have to check whether the user is alive inside the tab contents, even though we know that the tab contents would not have been rendered if this were not the case:

let show_user_profile person url = (* ... *)

let show_birthday_countdown person url =
  printf "%d minutes left to age %d"
    (get_minutes_to_birthday person)
    (match get_age person with `age a -> a | `person_is_dead -> (* Should never happen! *))

let profile_page person url =
  (* This renders all available tabs, then checks which one is 
   * active based on url and renders its content (default is the first one) 
   *)
  Tabs.render url
    [ "profile", show_user_profile person ]
    @ (if Person.is_dead person then []
       else [ "countdown", show_birthday_countdown person ])

Above, an if-then-else makes sure that show_birthday_countdown will only ever be called if the person is not dead (otherwise the tab is not available and the profile tab is shown instead). However, the show_birthday_countdown function does not know that, which means that pattern matching must still be used even though the error is known not to happen — and I hate “should never happen” comments in code. Should another programmer reuse the birthday countdown in another context where the unwritten assumption that the person is alive is not verified, the compiler will not be able to detect it, and the “should never happen” will happen. Constructs like the above should be avoided like the plague. Instead, the unwritten assumption should be written into a type.

Let’s build a different type that is known to be a living person:

module AlivePerson : sig
  type t = private Person.t
  val is_alive : Person.t -> t option
  val get_age  : t -> int
end = struct

  type t = Person.t

  let is_alive person =
    if Person.is_dead person then None
    else Some person

  let get_age person =
    Date.subtract Date.today (Person.birth_date person)

end

How does this work? First, we define a private type — this means that within the definition of the AlivePerson module, AlivePerson.t and Person.t are the same type, but outside, they are different and incompatible types. Being incompatible means that we need to define a function that creates an AlivePerson.t from a Person.t — this is what AlivePerson.is_alive does : if the person is alive, it is returned with type AlivePerson.t, and otherwise no value is returned. This proves that only a Person.t who is not dead can be turned into an AlivePerson.t. Finally, we define the AlivePerson.get_age function which, on living persons, always returns the age without checking for them being alive.

Next up, we must re-define the original get-age function, because we don’t want to repeat ourselves: only write the age calculation code once! The good news is that the AlivePerson module provides everything we need to do this:

let get_age person =
  match AlivePerson.is_alive person with
    | None       -> `person_is_dead
    | Some alive -> `age (AlivePerson.get_age alive)

The function signature remains the same, so it’s a safe bet to say that the behavior did not change (and unit tests should make sure that it did not). Then, we can finally rewrite the offending tab-rendering code so that the countdown tab expects an alive person as an argument :

let show_birthday_countdown alive url =
  printf "%d minutes left to age %d"
    (get_minutes_to_birthday (alive :> Person.t))
    (AlivePerson.get_age alive)

let profile_page person url =
  Tabs.render url
    [ "profile", show_user_profile person ]
    @ (match AlivePerson.is_alive person with
       | None       -> []
       | Some alive -> [ "countdown", show_birthday_countdown alive ])

No compiler error occurs, and there is no more “should never happen” case in the first function. Also, should that function be re-used in a context where the assumption about the person being alive is not available, the compiler will issue an error (this value has type Profile.t but a value was expected of type AliveProfile.t) because type inference has determined (based on the call to AlivePerson.get_age) that alive is of type AlivePerson.t! Your unwritten assumption has been detected and checked by the compiler.

In Conclusion

In this article, I demonstrated how to turn exceptions into compiler-checked return values, this turning the compiler into a valuable ally that detects any locations where the addition of a new error might require altering existing functionality. I also illustrated how this enables the programmer to pre-empt the possibility of exceptions by injecting a compiler-verified proof that «this will not throw an exception» into a new module. The consequence of these coding practices is fewer mishandled exceptional situations, the ability to detect all consequences of a new kind of error, and a clean documentation of both error types and assumptions about what errors might or might not happen in a given piece of code.

Which, I believe, leads to better code and happier developers. What do you think?

Avoid Side Effects

Perhaps one of the single most useful aspects of functional programming is the absence of side-effects. In a pure functional language, there is no assignment — once created, you cannot change the members of an object or the value of a variable. All you can do is create new values based on old values. Advocates of these so called immutable values argue that there are many benefits related to what you can do with them: you can keep around both the new version and the old version (implementing an “undo” feature of sorts), often at a reduced memory cost due to sharing of sub-elements between the two versions.

While I agree with these, I believe there’s something far more interesting about immutable values — what you cannot do with them, and the consequences this has on feeling secure about your code.

In my current project, I use a coding style that absolutely forbids any form of side-effects with the exception of 1° a clearly delimited set of create/update functions that affect the database, 2° any side-effects that affect local variables and 3° during initialization. These are of course exceptions — while allowed, they are avoided in the vast majority of the code.

Reordering code

One simple consequence: the order in which calls are made is only constrained by things that the compiler can detect — the function that uses x is called after the function that returns x — instead of subtle «open must be called before send» errors, simply because there’s no way for «open» to affect «send» unless the former returns a value used by the latter. So, I can rearrange lines freely because the compiler will warn me when I make a mistake.

Protecting against invalid states

A corrolary of the above is that it becomes possible to design your types so that only valid states can be represented. For instance, consider a list that must contain at least one element — in a mutable context, there is no way to represent this within the type system because I could always write this:

for (i = 0 ; i < 10 ; i++)
  list.pop();

There is no way here for the pop function to indicate that it might cause a problem, because it returns nothing — the only way out is an exception, and this happens outside the type system in most languages. On the other hand, if we keep the original list unmodified and return the new list, this opens the opportunity for indicating whether additional pops are available:

type 'a one_or_more =
  | One of 'a
  | More of 'a two_or_more

and 'a two_or_more = 'a * 'a one_or_more

let pop : 'a two_or_more -> 'a one_or_more = function (_,rest) -> rest

By design, the pop function will never attempt to pop from a one-element list because it can only be called on a two_or_more value, and it returns a value which may or may not be a two_or_more — the code needs to check this, and the compiler will detect if you forget it:

let pop_n n list =
  if n = 0 then list else pop_n (n - 1) (pop list)

Here, the type of the list variable is inferred to be two_or_more (because it’s passed as an argument to pop) which means that pop list (of type one_or_more) is not a valid argument to pop_n. An appropriate solution (popping up to n, if available) takes this into account:

let pop_n n = function
  | (One _) as keep -> keep
  | more -> if n = 0 then more else pop_n (n - 1) (pop more)

Let’s summarize what happened here: I used a different representation for one-element lists and more-than-one-element lists, which let me selectively define a pop operation only for more-than-one-element lists, and this allowed the compiler to detect when the «must contain at least one element» constraint might be violated. Have fun achieving this in your non-functional language of choice! Using a different representation in this way was made possible by the fact that pop returns a new list instead of altering its argument — so the new list can be of a different type than the old one.

This is a fundamental principle in compiled functional languages: if a given state is not valid, design your types so that it may not be represented. When this is done correctly, not only does it become harder to write code that tries to create an invalid state, but when that code is written, the compiler rejects it.

Memoization

If you are certain that a given function is pure but involves computations that you would rather not repeat too often (such as database reads), you can resort to memoization: wrap the function in a piece of code that memorizes the result of every call. In my current project, this is as simple as:

let get_pic_url = Util.memoize Picture.get_url

From then on, asking ten times for the URL of a given picture will result in a single database request. Had the function involved a side-effect, memoization would have changed the program behavior: knowledge that the function is pure lets me apply memoization without any risk of creating a bug.

Retry until successful

Yet another situation is when dealing with CouchDB transactions, which involve the following process:

  1. Fetch current version of the document from the database
  2. Modify the document
  3. Write the document back to the database
  4. If a collision happened, repeat from 1.

In short, the application must repeat steps 1-2-3 until a write succeeds. Correctness of these algorithms is difficult to test because development servers seldom have enough active users to actually cause conflicts (and when conflicts do happen, they are hard to trace). By deciding that steps 1-2-3 are handled by pure functional code, I have the guarantee that the loop can be executed any number of times and yield the expected result. Right now, a transaction looks like this:

let accept id = DB.transaction id
  begin function
  | None -> (* No object in database *) DB.Keep
  | Some post ->
    if post # accepted then
      (* Post is already accepted *) DB.Keep
    else
      (* Accept post *)
      DB.Put (object
        method accepted = true
        method author   = post # author
        method text     = post # text
      end)
  end

By applying a “never use side-effects” convention, I know without even looking at the function contents that there are no side-effects inside.

These four examples are original in that they don’t try to showcase the power of immutable values and how they can be used. Instead, they simply illustrate how a program without mutable values has interesting properties that make refactoring easier, help the compiler detect many more errors, and allows the application of patterns that can only be safely applied in the absence of side-effects. Like the monks of old, I write my software without the mundane facilities of mutable values, but I derive significant benefits out of it.

Objective Caml Web Programming

The core RunOrg¹ application clocks in at about 30K lines of Objective Caml code, with around 2K being added every week. If you factor in our use of CouchDB, all of this might strike you as an odd choice of technologies, based on esoteric hopeful fantasies instead of cold pragmatical consideration. It isn’t, despite what others might say:

OCaml: You know yourself to be fast, smart, and extremely reliable. However, you look kind of funny and nobody really wants to talk to you. You spend most of your time sitting in a public library glaring at people, occasionally yelling “NOBODY HERE APPRECIATES MY GENIUS!” and getting kicked out.

Two years ago, I discussed the topic of using Objective Caml for web programming:

What would happen if a compact web framework were proposed? One that, in addition to borrowing existing useful concepts from other languages, also added some OCaml-specific features to the mix. Functional modules would be an interesting addition, so would be the type system and pure functional programming applied to transactions, and monadic optimization at initialization time would also be quite interesting.

Eliom

Let’s get this out of the way first. I have been continuously peeking at Ocsigen – Eliom (a web server and assorted web framework) ever since it was mentioned in a comment, and some aspects of it resonated with me while others really did not. In many ways, it served as a showcase of the many ways in which the peculiarities of Objective Caml can impact the development of a web project, and helped me decide whether these were appropriate or not. This evolved into my own rendition of a web framework, Ozone, connected to an apache server through OcamlNet2-powered FastCGI.

There were many reasons for avoiding Ocsigen – Eliom, though I do not believe any of them to be universally true. The main reason was described in Guillaume Yziquel’s comment on that article:

Somehow, even a Ruby on Rails app is a state machine. Perhaps a “better state machine”, but a state machine nonetheless, in the sense that incoming requests interact with each other by modifying the internal data.

With Ocsigen / Eliom, it’s completely different: it’s a “safely” multithreaded, compiled, application. And that makes all the difference.

Based on my experience with Ocsigen – Eliom, I fully agree with this assertion, but consider it a liability in my situation. Our business plans call for a number of users that cannot be safely expected to all run out of a single server, be it multi-threaded, for both scaling and redundancy reasons. At some point, the only communication bridge between two requests will be the database back-end, and I need my web framework to accept that and actually make sure that my one-server code will gracefully scale up to a multi-server setup.

On a more philosophical level, I agree that «On [the] server side, somehow, the “state machine” paradigm has been a hindrance», but HTTP being what it is this is a basic truth that will not go away. Eliom is building an abstraction on top of it that will continuously spring leaks whenever the disconnected nature of HTTP surfaces. This is what ASP.NET and countless other technologies tried to do and they have all made the fall back to HTTP harder when the situation did eventually ask for it.

Ozone is also a compiled application, but it has one thread and no sessions — scaling happens by launching more instances of the application and therefore supports transparently the addition or removal of servers, while “session data” is stored in a combination of client-side state, database storage and HMAC proof tokens in the URLs. While this ascetic approach cuts me off from the sheer sexy of what Eliom allows, the tradeoff is a fairly convenient set of scalability guarantees. But if you can afford all that Eliom sexy, then I have no issue with that.

Benefits of OCaml

This is why I use Objective Caml, in no particular order.

  • It’s fast out of the box — OCaml is on par with C performance as long as you don’t stray too far into sub-optimal areas (such as naive string concatenation). I can write any kind of code and be assured that it will not be the bottleneck, because database access and HTTP are a lot slower: right now, the average HTTP request takes about 80ms, with about 60ms for the actual HTTP transfer, 18ms for database latency, and 2ms for all of the Apache-FastCGI-Ozone sequence when compiled without optimizations.
  • It’s a compiled application. This one is mostly aimed at my PHP friends, where every request starts a new PHP execution from scratch — this makes it several designs impossible or impractical, such as event-based programming: this would require B to register as a listener to A’s event, which means B should be identified as a potential listener and loaded for every request even if it does not trigger the event. Once initialized, a given Ozone instance can respond to tens of thousands of requests, which makes it worthwhile to run a lot of pre-processing and pre-caching operations during initialization.
  • It’s safe. I use a programming style that relies on avoiding exceptions, never using wildcards, defining many new types for almost everything, and writing pure functional code. This eliminates entire realms of bugs : using the wrong variable, forgetting to call a function or catch an exception, being surprised by a sneaky side-effect or doing things in the wrong order… About half the bugs I caught using Unit Tests don’t exist in OCaml (null reference exceptions, anyone?) and the other half is eliminated by my programming style — so I don’t write unit tests anymore (well, I do write an automated “test” every time I find a bug, but it’s usually as simple as adding a type annotation). This also lets me routinely refactor literally half the application every other week, without causing any bugs.
  • It’s concise. Most of the features I write are a matter of a mere hundred lines — most of the code is related to my obsessive need for being explicit. Being a functional language, you can define a brand new anonymous function on the spot and throw it into another function that is returned by yet another function which is then given to yet yet another function, all of it being implicitly type-checked without having to define a single IAcceptsBoxObserver interface or LeafBoxObserver implementation.
  • It has a fast compiler. Building those 30KLOC from scratch takes less than a minute — the average incremental build takes one or two seconds. Whenever I have any doubts about what I’m writing, I can just ask the compiler — Hey, did I forget anything about this function call? Why yes, master, you forgot to check that the user was indeed allowed to reply to that message.

The most essential feature is complete compile-time safety. As a web programmer, I have to be careful about hundreds of small details — can this text be translated into another language? Is this user allowed to do what they just did? Did that object disappear from the database while you were editing it? Does that URL really correspond to an actual page? Did you remember to check for script injection in that piece of HTML? Is this GET parameter available at this point in the code? Is this object available or locked by another user? Did I forget anything else? It’s impossible for a human brain to think about all these things while at the same time creating an elegant design or refactoring a piece of code or writing a new feature. I can use the flexible OCaml type system to check for all these details through appropriate design of the Ozone API, which turns the development process into a game of 1° write the simplest code that works, 2° listen to the compiler’s suggestions for making it fail-proof. It’s a game that I’m becoming fairly fond of, and it lets me concentrate on the very core of what I’m trying to do.

Disadvantages of OCaml

It’s not a happy fun place. Quite the contrary: the language comes with a set of annoying quirks and flaws that do make things harder. Before you jump in, you should know what to expect.

  • Type-safety has a price. If the type system cannot express a certain thing, then you can’t do it. There are a few fairly complex examples where this has caused me trouble, in areas such as optional function arguments, module meta-programming, JSON serialization or dynamic database-driven data structures. Workarounds exist, but they’re only workarounds. Another side-effect is that type inference can make it hard for inexperienced developers to find an error, especially if you do a lot of strange type wizardry. Not to mention the silly yet annoying “this expression has type foo but is used here with type foo” error.
  • Lack of tools and libraries. Being a non-mainstream language means there are no heavily tweaked and highly evolved tools available (think about the wealth of tools available for C# or Java development), which gives a certain clunky feel to development. Besides, many libraries which are taken for granted in the mainstream world are missing or non-documented — try connecting to the Facebook API and you’ll notice that not only there is no Facebook SDK in OCaml, but there is also no documented way of using HTTPS. The same goes for Amazon S3 and MD5-based HMACs, by the way. And iconv functionality. And removing the X-Mailer header from e-mail you send. The list goes on.
  • It’s not object-oriented. You can use classes and mutable objects — it’s a viable implementation strategy, but it also bears a lot of the typical issues encountered in the mainstream programming languages, and it lacks the conciseness of functional approaches (defining a class and instantiating an object is bound to be longer than a lambda). If you’re not in the right mindset for using the language, you will miss on a lot of the benefits.
  • It’s not popular. It is a disadvantage, just not a technical one. As a programmer I couldn’t care less about the popularity of my language because, you know, COBOL was very popular once. As a hiring manager, I am aware that using a non-popular language will make hiring developers harder. As a start-up founder, I know that this reduces my chances of selling my company because esoteric technologies are a risk to potential buyers.

There are also many tiny quirks in the language that I hope would eventually be solved. For instance, there’s the absence of a shorthand notation for the ubiquitous (fun x -> x # member). There’s also the lack of C#-like properties, with a pure functional twist:

val x = init

method get_x    = x
method set_x x' = {< x = x' >}

And, of course, there is a lot of things going on with the option type that BatOption just isn’t up to expressing concisely. The P4 preprocessor could be applied to these situations fairly reasonably, but I would feel more comfortable if they were built into the language (and syntax highlighting tools).

In conclusion, OCaml + CouchDB provide our team with the flexibility required to build new features frequently without being afraid of subtle bugs or regressions, and to regularly refactor our code into a more amenable mess. It is a level of compiler-provided safety, surgical refactoring and bug detection that would be simply unavailable with C# and Java (and hopeless with PHP, Python or Ruby).

¹ RunOrg is my Start-Up ; we provide an online tool that helps associations, unions, organizations and communities manage their members, contacts, activities, events, knowledge and online presence.

Idle Musings on CouchDB Architectures

One of my recent application design decisions was to go the full NoSQL route, using CouchDB as the database. For those of you who never heard of it, CouchDB has the following benefits that I rely on:

  • Flexible format. It stores any JSON-encoded object instead of lines with pre-defined data columns. In fact, there is no built-in schema, so the application code is free to store any kind of data format in there without having to run ALTER TABLE statements. This was the biggest selling point for me, because it drastically reduces the development time of any database-related code.
  • Lock-free Master-Master replication. Out of the box : you can have two CouchDB instances running on two servers, and send write requests to both, and they will eventually be in sync. This is great for both spreading the write load and for handling server failure gracefully. The catch: your application is partially responsible for not screwing things up.
  • Complex Queries. As far as queries go, CouchDB only caters to high-performance queries. Just like in a traditional RDBMS, query performance is improved by defining an index. The difference is that if no index is defined, you cannot run the query on CouchDB (whereas with SQL, the query would run by doing full table scans or in-memory sorting or similar low-performance fallback solutions). This means that if your queries run, they’ll run fast, but it also requires you to put more thought into any requests you might need to make. On the other hand, CouchDB indexes (they’re called views) are much more powerful than standard SQL indexes. If you wanted to, say, sort a list of users by the number of non-ASCII characters in their name, you could certainly do so.

The general consensus over NoSQL is that it should mean «Not Only SQL» : use an SQL database for the added query flexibility and transaction management, and use a non-SQL solution where performance requirements create a need for it. The typical solution would be to have the SQL database act as a Master and regularly update the data in the otherwise read-only non-SQL database. And there are pretty good points to be made about that, such as Enterprise products being expected to respond to a vast range of queries.

Still, RunOrg¹ is not (yet) intended as an Enterprise product, and all the data display goes through the web screens we design — so there’s no serious pain in writing the CouchDB views at the same time.

And while CouchDB does not support transactions, there is a clean way in which transaction-like effects can be achieved (with the added benefit of being lock-free): every document in a CouchDB database stores a revision hash, and updating or deleting an element requires you to provide the current revision hash as part of the query. So, the typical update process looks like this:

  1. Read current document.
  2. Construct new document based on existing data.
  3. Send the update query, using the revision hash from step 1.
  4. If step 3 failed (because the document was changed while you were working on it), go back to step 1.

If you’ve ever used Subversion et al, you can recognize steps 1-2-3 as being Update, Merge, Commit. It’s that simple. The catch: it’s limited to only one document at a time, so you cannot atomically update two documents.

Of course, if you’ve ever used CouchDB, you probably know all of this, so let’s get to the point already. When you design your CouchDB databases, there are several important things you need to keep in mind. In no particular order, they are:

You Need Asynchronous Processing

Any CouchDB setup needs some tending-to on a regular basis. Databases have to be compacted (I do so hourly, but I expect this to evolve as the number of users increases), changes between databases will need to be propagated and conflicts will have to be detected and handled.

In addition to that, you will probably have application-related needs, such as sending e-mail (you don’t want to lock the HTTP request until the SMTP conversation is over, especially if you’re sending more than one message at a time), processing image files or office documents (you probably want to do that processing on a different server with the appropriate software anyway) or long-running requests.

In short, you need one or more asynchronous processing bots and interacting with the database. To make things easier and reuse the data access code, I just write my software to be able to run in both async-bot mode and in HTTP-server mode. This ended in designing “save task to database for later execution” construct in my supporting library.

Determine the Master Data and Propagate

While the CouchDB view system is fairly flexible, it’s not universal. The textbook example is blog article tags. The tags on a given article would be stored as an array in the article document itself:

{
  "title" : "Idle Musings on CouchDB Architectures",
  "body"  : "...",
  "tags"  : [ "Architecture" , "CouchDB" ]
}

This simple format lets you 1. see an article’s tags, 2. find all articles with a given tag and 3. count the number of articles for each tag. It does not, however, let you find the ten most used tags — you could certainly query the entire “number of articles for each tag” view and then sort the data in memory, but if an application contains tens of thousands of one-document tags, you’re basically querying 1000 lines for every line you display.

The suggested solution is to create another database (usually in the same database server) to store documents representing the tags in a more adapted format, such as:

{
  "_id" : "Architecture",
  "num" : 18
}

This format lets you easily sort the document on the num field. Problem Solved!

How you actually copy the data from one database to the other is up to you. Just keep in mind:

  • Always clearly determine which documents or fields are original data and which are cache data. You don’t want to mistakenly update the master from the slave, or update the slave instead of the master. Try to keep the distinction at database level (this database contains slave data), and only resort to slave fields when it’s absolutely necessary.
  • You need to have a periodic refresh process that rebuilds the cache from scratch, just in case it ended up out of sync. Depending on your data, a day-long process, an hourly process or a midnight process might be more adapted.
  • How fresh must your data be? Perfect freshness means you need to update the cache as part of the normal document-saving process — great for having up-to-date data, but slows things down. Minute-level freshness lets you delegate the cache update to an async process that detects the change and refreshes the cache. With hour-level freshness, you can rely solely on the complete cache rebuild.

Either way, have a consistent picture in mind of how you want to achieve this before you need it — trust me, copying data to another database is a lot easier than hand-crafting the perfect data structure to handle every query you need.

Don’t Update, Insert-and-Merge instead

Updates are slower than plain inserts because you need to read the original data first, and they create the risk of conflicts with Master-Master replication. The textbook example is with Alice and Bob updating a given document in their respective databases: Alice sets the title to Foo, and Bob adds a few words to the article’s body. Then, replication happens, a conflict appears and you need to determine what the title and body should be — in this case, the sane thing to do would be to keep Alice’s title and Bob’s edits to the body, but you don’t have enough information when resolving the conflict to actually know that.

Now, consider a different strategy: Alice inserts a “set title to Foo” line in her database while Bob inserts a “change document body” line in his database. You can then retrieve the current version of a document by reading all the lines related to that document and merging them together according to whatever rules you see fit (and, for bonus performance points, save the result to another cache database as described above). When the replication happens, both lines will appear in both databases, the merge code will run again, and both changes will appear in the resulting document.

And you get a free revision history with the ability to selectively cancel changes down the line. Or, you can decide to compact older inserts (when it becomes obvious that there’s no risk of collision anymore) to save memory and improve merge performance.

Please note that this applies only to master data — slave data conflicts can be trivially solved by refreshing the cache from the master, so update slave data to your heart’s content.

Avoid Unique Constraints

This is possibly the single most annoying issue with CouchDB, but it’s pretty much part of the distributed Master-Master package. The basic idea is this: you need a given field or value provided by the user to be unique. For instance, you don’t want two accounts to have the same username. So, when users pick their user name, you need to atomically check if it’s available and reserve it. This is impossible in a Master-Master scheme without expensive locks or elaborate cross-transactional strategies.

The first thing you should try is to eliminate that constraint or turn it into something a little bit more amenable. For instance, if two users reserve the same e-mail address as their username, you could detect that once it happened and merge the accounts — they probably belong to the same user. When it’s applicable, the detect-collision-and-merge solution is the one that’s easier for performance.

The last resort solution, which isn’t actually that bad, is to give up on Master-Master replication for that specific feature. You can have a dedicated database to store username-id relationships:

{
  "_id"    : "victor.nicollet",
  "account": "3958377a5093b22673a26b6c33002e02"
}

The actual account would be stored in a different database which does use Master-Master replication:

{
  "_id"     : "3958377a5093b22673a26b6c33002e02",
  "username": "victor.nicollet",
  "fname"   : "Victor",
  "lname"   : "Nicollet",
  "passhash": "..."
}

All creation requests would first try to create the username document in the first database — if it’s already taken, an insert conflict happens immediately and you can react by asking the user for another username. If no conflict happens, you can then insert the document in the second database with no fear of conflicts. The same process happens when trying to change the username of an existing account.

This solution works as long as the number of creation requests remains small, and you can afford the round-trip to the central database — which might be on a different continent (or worse, you might be working offline).

On The Splitting Of Databases

With CouchDB, there’s the inherent restriction that a given database must be entirely contained within a single server. Replication lets you split the request load, but the storage load remains the same as both duplicates must store the exact same data eventually. In addition to that, the underlying storage strategy is insert-only, meaning that if you update a given document a thousand times, you eat up a thousand times the memory footprint of that document — and in order to run a compaction, the server needs to have enough disk space to store both the uncompacted database and the newly compacted one, so always make sure you’re not running close to full disk usage.

Keeping the database small is done in a number of ways. For instance, keeping ID values small (I use 11-character base-62 UUIDs) or having short fields name in the JSON documents. One of the most potent techniques is simply splitting the contents across several databases.

One splitting strategy is to act based on the ID or another reasonably distributed property of the document. If even ID values are on server 0 and odd ID values are on server 1, you do not hinder your query-by-ID possibilities at all (you can determine if the ID is even or odd quite easily in the application, and pick the appropriate server). On the other hand, views don’t work anymore, since neither server can aggregate the entire range of existing values. You can get views to work by striving to keep together subsets of values that are queried together — RunOrg¹ stores items together on a per-association basis, so all items of a given association are available on the same server and can be map-reduced together — but this makes query-by-ID harder to achieve. Tradeoffs, as usual.

Another splitting strategy is to act based on the type of the document. Put user accounts in database A, posted articles in database B, article incremental updates in database C and user comments in database D. As long as there’s no database-level interaction between the different documents, this will work, and the code loading a document knows what kind of document it’s loading and where it should load it from. The catch: a view from database B cannot return documents from database A (so, no querying for “the article and its comments” or “the updates and their authors”). Be aware of that limitation ahead of time.

Keep A Schema

Not requiring a schema is no excuse for not having one available. It always helps to remember what is being stored where and how (and why!), which documents are updated from which others, and so on. While standard relational representations can work, don’t be afraid to include document-store-related representation features, such as the way for an item to reference a list of other items, and views.

Final Words

As it stands, CouchDB is certainly expressive enough for our needs, once a few elementary features (propagation and asynchronous tasks) are available. The high flexibility of the data model, combined with the fact that an application can easily be made to read several document formats and convert to the latest one when possible, greatly improves the speed at which new modules are developed — right now, model code accounts for 5% of work time, with refactoring at 15% and user interface (HTML-CSS-JS) towers at a heartbreaking 80%.

What are your takes on CouchDB? Do you use it? Did you try it? Any experiences you might have that would be worth sharing?

¹ RunOrg is my Start-Up ; we provide an online tool that helps associations, unions, organizations and communities manage their members, contacts, activities, events, knowledge and online presence.

Don’t Push – A Small Review of Cache Strategies

The standard behavior of most cache system follows these steps:

  • Attempt to read needed data from the cache
  • If data is missing, compute it and place it in the cache
  • Return the data

This is a fairly streamlined process that’s easy to add to almost any single algorithm that constructs data. The cache could be local (it’s part of the application, or even of the current function call), it could be dedicated (memcached), or the data might be persisted back to the database (such as adding the number of files in a given folder to the folder object itself instead of counting them every time).

The root of this strategy is the principle of memoization: if a function is pure — that is, calling it with the same arguments twice will return the sale result twice — then you can place such a cache in front of that function so that it will only be called once for every argument.

Memoization obviously found its way into RunOrg, because it’s literally a one-word optimization hint that trades memory for performance where it matters. In practice, in a web application like RunOrg, the only really costly computation is sending requests to the database, which is by definition not pure. Still, I can usually expect that for the duration of a single HTTP request, the database contents will remain reasonably stable, so I can create a temporary memoization cache when the request starts and drop it when the response is sent. Actually, I’m using a slight variation on standard memoization which is batch memoization: in order to improve performance, queries for objects A, B, C and D are represented as a single batch query to the database asking for the list of four objects. With standard memoization, if I asked for objects A, B and C, then for objects B, C and D, then six objects in total would be requested (because those two lists are different). By extending the memoization algorithm to know that list elements are independent, I can have the second query ask for only D, and retrieve values for B and C based on the first request.

Outside of this situation, however, functions can hardly be considered pure, so special steps must be taken to keep the cache up to date. This results in three common strategies.

The first one, which is the cache expiration strategy, is to give up on data freshness and decide that data in the cache will survive for a fixed duration regardless of whether the actual data has changed or not — so, instead of declaring that data is always up to date, it declares that any change will be visible everywhere in less than X seconds. While somewhat weak, this strategy is particularly effective because it does not need any kind of knowledge of the relationship between the cache and the underlying persistent data — the only connection between the two is the compute-if-missing steps outlined above.

Once you decide to handle the relationship seriously, two more strategies become available. The cache invalidation strategy is activated when the underlying data is changed, and invalidates all cache items that are dependent on that data. Thus, subsequent requests for those items will trigger a cache re-computation and always serve fresh data. Of course, this means that the cache system can easily tell which items should be invalidated. This is fairly easy in a one-to-one data-to-cache mapping, but as pieces of data can be mixed and matches into various cache items, this requires an unusually complex architecture to handle.

A nice design trick to keep in mind is that you don’t always need to find all that data — sometimes, you can simply «lose» it: for instance, web caching uses a combination of expiration and invalidation strategies. When a file  is sent to a browser through HTTP, it sometimes carries a header explaining when it expires. This is useful when all the pages on your web site use the same CSS or JavaScript files, because then your visitors will only need to download them once and will use the cached versions from the on. To handle changes in CSS and JavaScript files, some web sites rely on cache expiration (a few minutes or hours of cache lifetime, so that any changes are detected soon enough) while others use cache invalidation. Obviously, the web site can’t go and notify every single browser that the CSS files have changed (especially since some browsers are closed or offline) so it will simply lose the data: while the CSS file named style.css?0001 will remain in cache for up to a month, the pages on the site are now asking for style.css?0002.

The third, the cache refresh strategy, is a variant on the former: instead of merely invalidating the cached data, this strategy computes the new data and places it directly into the cache. This is necessary when the data is frequently accessed and the computation is long: if one hundred users come asking for the data while it’s invalidated, then all of them will compute the data as part of the “if missing, compute” step of the caching process, which will probably bring the server to its knees — what people call a stampede — so the only safe thing to do here is to keep data in the cache at all times and replace it with a more recent value whenever necessary.

Another nice design trick is to use a flexible expiration date to turn a cache invalidation strategy into a cache refresh strategy: instead of invalidating the item by removing it from the cache, merely set its expiration date somewhere in the past. Then, to avoid stampedes, whenever an user detects that the cache is expired, they set the expiration date to sometime in the future and start computing the data. So, the first reader will notice a delay (as his request will reconstruct the cache), subsequent readers will instantly read the old cached version without triggering a stampede, until the first request ends and the cache contains the new version of the item. To choose between the two: if the event that triggers the refresh provides you with data that improves the time required to update the cache, then update it, otherwise merely invalidate it and rely on your stampede protection to do the job.

There’s one last situation that makes caching complicated, which I’ve recently had to handle with the RunOrg application — indexing. Suppose you have a huge amount of data that you need to wade through, nicely split into separate objects that each have their own responsibility (my profile, my membership information, my participation to event X, my answers to poll Y…) but you sometimes need to virtually aggregate all that data and traverse it to retrieve only certain parts of it: give me the name, premium-or-not-premium membership status and answer to “T-Shirt Size” in poll Y, sorted by the registration date to event X. Yes, that’s one of the many queries that RunOrg lets you do (and you can even print out that data to serve as a list of participants). Now, trust me, there’s no sane way to dynamically run queries of the sort on a clean normalized database and still get reasonable performance. So, you need to create an alternate, denormalized representation of all that data and keep it in cache to avoid re-computing it for every request.

The problem with such a cache is that you cannot afford to re-construct a thousand lines of cached data because one user changed their T-Shirt Size answer, so there can be no high-level validity check. Basically, if you can’t trust the cache to be up to date when you run your read query, you lose. The traditional “try to read, if invalid update” approach to caching goes straight out the window. You need a solid cache refresh implementation that pushes the most recent data into the cache as soon as it becomes available.

RunOrg uses a variant — the cache pull strategy. This is merely a small semantic shift, but it’s quite helpful: in the original cache refresh situation, the data model needs to be aware of the cache, because it must actively send data to the cache whenever a change happens. With the RunOrg variant, the data model merely publishes a “data was updated” event that the cache module may listen to and react by refreshing its contents. So, the knowledge of how to extract data from the model and place it in the cache now belongs to the cache module instead of being spread over both the data model and the cache module. This not only makes the code cleaner — the data model becomes cache-independent and thus easier to read through, with cache modules being tacked onto it using the event system — but it also lets the cache react to events from different data models: an item might be updated when the profile changes, or the membership information changes, or the event X participation changes, or the answer to poll Y changes… and will have to read data from all four to compute the new value anyway.

Obviously, the cache pull strategy is a more complex architecture than the previous one:

  • You need an event system — the entire contraption hinges on the fact that a cache module can listen to the changes that happen in a data module.
  • Your cache module must track the dependencies of each item, in order to update that item when it receives a change event for one of its dependencies
  • You need asynchronous processing, as pulling values for dozens of items simply cannot be done as part of the standard HTTP response cycle
  • You need to follow clean multi-process patterns to handle simultaneous updates of some items

Still, given the performance we achieve with this approach, and the clean code that results from the underlying events-and-async structure, the results are certainly worth the efforts.

RunOrg is my Start-Up ; we provide an online tool that helps associations, unions, organizations and communities manage their members, contacts, activities, events, knowledge and online presence.



1150 feed subscribers
(readers who polled a feed this week)