Monthly Archive for July, 2011

Short OCaml Diffs

I’m working on some wiki functionality right now, and one of the basic properties of a wiki is that history is kept for all pages. And this should be done elegantly, without actually keeping a copy of every version, especially since many changes are actually just small fixes. This is usually done by keeping a diff — a small piece of data that stores the changes needed to get from version A to version B. Computing an arbitrary diff is a fairly easy task, but computing a diff that is as small as possible is actually challenging, because there are many ways in which a diff can be created and only a few of these are optimal.

The standard approach to computing diffs is to find the longest common subsequence — a non-contiguous sequence of characters that is found in both versions, such that one may turn version A into the sequence only by removing characters, and then turn that sequence into version B only by adding characters. The list of additions and removals is what is commonly referred to as a diff, because this is how the UNIX program diff works.

I am not happy with this solution. For one, it works based on lines, which is fairly acceptable for source code that is cleanly split into many lines, but the average wiki page is not as clean. Instead, every paragraph tends to stay on its own line and is only word-wrapped for convenience. Using a line-based diff tool on such data would cause the removal and addition of entire paragraphs, which is certainly not optimal if the change was a minor typo fix on a single word. And since the diff algorithm itself is quadratic, it cannot be coaxed to work on characters instead of lines : a one-millisecond line-based run would take ten seconds on characters !

Another shortfall of the longest common subsequence approach is that it fails to detect sentence or paragraph swaps, yet these happen quite frequently in the rewriting stages of a wiki page. Indeed, regardless of how clever the diff algorithm is, the diff format itself provides no primitives for moving data around, only “add this” and “remove that” are supported.

I have designed and implemented a custom diff algorithm. Instead of working based on “add” and “remove” primitives, it carries a data segment where all the new content is stored, and works with “blit from old” and “blit from data segment” instructions to construct the new version from the old. Its format is optimized to be stored as JSON, and the diff application algorithm is simple enough to be handled by a short piece of JavaScript if necessary. It’s on GitHub, by the way.

For instance, consider the sentences “The quick brown fox” and “Brown foxes are quick“, and how the former can be transformed into the latter.

A longest common subsequence approach (on characters, assuming this is sane) would determine that the LCS is the seven characters “e quick“, and so it would be written as “ThBrown foxes are quick brown fox” which is indeed the shortest diff you can get using the LCS approach.

My algorithm instead determines that contiguous substrings “rown fox” and “e quick” are present in both sentences, so the diff would be written as “Brown foxes are quick“. The corresponding JSON is quite short, too:

["Bes ar",1,[10,8],5,[-12,7]]

The first element is the data segment (the five new characters), single integers are blit-from-data-segment (length) instructions, and integer pairs are blit-from-old (offset,length) instructions.

The theoretical algorithm complexity is O(m+n), the actual implementation on GitHub is worst-case O(mn) because I used lists instead of hash tables at one point, but should actually be O(m+n) on any input that happens to be written in a human language because working on a human language almost guarantees that those lists will contain zero or one elements in most situations.

The algorithm itself relies on statistical properties of character pairs to quickly match up sequences from both strings, such as by finding out that the ox pair only appears once in each sentence, or that the Br pair only appears in the new sentence. On medium-sized texts (up to 50,000 characters), there are several character pairs that are unique or missing, which helps the algorithm guess correctly where each piece came from. Once the source of the pieces is identified, generating the diff is extremely simple.

The Plans & Pricing Page

While creating my own Plans & Pricing page, I collected screenshots of those same pages from a variety of companies. This helps illustrate both what everyone is doing, and how some of them are innovating on the matter.

backupify.com – backups for cloud data

basecamphq.com – online project management

bitbucket.org – online code hosting

ginzametrics.com – SEO analytics

github.com – online code hosting

huddle.com – online project management

raventools.com – SEO analytics

rhapsody.com – cloud music service

seomoz.org – SEO analytics

Detailed Usage Statistics

Use experience is heavily influenced by how fast the application or web site reacts. It makes sense to set up tools that help the development team detect performance issues and correct them. I’ve shamelessly pilfered the basic ideas illustrated by Jeff Atwood last month :

That’s why, as a developer, you need to put performance right in front of your face on every single page, all the time. That’s exactly what we did with our MVC Mini Profiler, which we are contributing back to the world as open source. The simple act of putting a render time in the upper right hand corner of every page we serve forced us to fix all our performance regressions and omissions.

When logged in as a member of the team, every request sent to the server, including all AJAX requests, display a small summary of what happened during that request:

In addition to these real-time stats, the server also saves this profiling data to the database (at a cost of an additional millisecond or two). That database is then sliced up into atomic operations like “Get Picture”or “Get Item”, averaged, and consolidated into charts:

The operations are sorted by Impact = Duration × Frequency, which happens to be an accurate approximation of how much performance could be gained by optimizing a given operation. The database load is not a duration, it’s a summary of how many requests were performed and how many bytes were sent or received, but it does not reflect how long it took the database to actually process the request.

This chart tells me that, among all atomic operations, accessing a picture is the most important one — which is quite expected, because every page contains several user pictures: even though the operation is very fast, it’s executed extremely often. So, it would be possible to slash request times across the board by either de-normalizing the picture URL into the user profile (so that both the name and picture URL are loaded from the database in one read) or by storing the picture-identifier-to-URL mapping in a distributed cache such as memcache.

The second most important operation would be user identification: determining the relationship between an user and the workspace they are trying to access, which happens on every single HTTP request. Whereas grabbing a picture is a simple fetch-by-identifier from CouchDB, user identification involves accessing a view with include_docs=true, which is slower. The optimization I am considering is to provide the user with a cookie that contains the identifier of their relationship to a given workspace, which is easy because every workspace has its own sub-domain, and run a fetch-by-identifier from CouchDB instead of a view access (of course, if the cookie is incorrect or missing, the server will fall back on a view access). From my experience with similar situations, this would cut the execution time in half.

NoSQL Is A Premature Optimization

Or so Bob Warfield writes. I happen to agree with the title — optimization using NoSQL means using a server cluster to split the load and scale up, and such an optimization is premature unless you are already having the millions of visits it takes to feel growing pains. If I start off on a new project and decide «I’m going to use NoSQL so that it will scale when my project will have millions of users» then I am prematurely assuming that your initial NoSQL strategy will fit the actual million-user scenario that will come up years from now. In fact, the bottleneck will probably be in a feature I didn’t even think of yet, and making it work will probably involve changes in the persistence model. But Bob Warfield goes further than the premature optimization argument:

Point 2:  There is no particular advantage to NoSQL until you reach scales that require it.  In fact it is the opposite, given Point 1.

It’s harder to use.  You wind up having to do more in your application layer to make up for what Relational does that NoSQL can’t that you may rely on.  Take consistency, for example.  As Anand says in his video, “Non-relational systems are not consistent.  Some, like Cassandra, will heal the data.  Some will not.  If yours doesn’t, you will spend a lot of time writing consistency checkers to deal with it.”  This is just one of many issues involved with being productive with NoSQL.

My current SaaS project pivoted from MySQL to CouchDB nearly at the beginning, certainly before we had any customers or any features worth showing. My greatest fear when settling on CouchDB was that I would have to work around the NoSQL lack of transactions, joins, consistency or whatever else you expect from a database system.

I was sorely mistaken, and so is Bob Warfield.

Even though NoSQL fails to solve many of the low-level problems that SQL eats for breakfast, this does not make it incapable of solving the same high-level problems as traditional relational strategies, you just need to understand how to do it, in the same way that you had to understand relational algebra, joins, indexes and transactions before doing anything worthwhile with SQL. Coming to NoSQL and expecting to solve your problems with those same strategies that you used in the relational world is as silly as using a hammer to drive screws in.

For instance, CouchDB has no global consistency, only eventual consistency — there is an inconsistency window where state spread across multiple documents can be inconsistent. This will make any relational programmer scream bloody murder. And yes, if you absolutely and positively need to have that state stay consistent, then you will need some application-side code to do it, and it will ruin your productivity.

But most applications don’t need global consistency, in fact an inconsistency window of a few seconds is acceptable in most situations. It is the programmers who need global consistency, because they do not have the mental tools required to work with eventual consistency. But once you get the hang of it, there is no working around, no overhead, no additional steps or checks required to make your application work. It is a different route, but not a longer one.

In addition to the above, from my experience, there are clear and significant benefits to using CouchDB over MySQL that are not related to scalability or performance. These benefits may well be useless to your specific situation, but they do exist.

1. Schema changes are painless and non-locking

This (lesson 3) is what brought me to NoSQL in the first place.

CouchDB does not implement a schema in the way an SQL product rigidly delineates tables, columns and relationships. Of course, it would be foolish to actually have no schema concept at all, so there is a dedicated schema layer in our application architecture that describes what the CouchDB “tables” look like, in terms of serialization and deserialization. Schema changes are therefore a simple change to the deserialization process, which needs to be able to read the old data format.

For simple changes, such as adding a field with a constant value, no work is required as the deserialization layer can fill in the missing field on the fly. For complex changes that involve application-provided data, such as adding a “file size” field that needs to be initialized with the actual file size, there is a clear benefit to having the application itself perform the schema change, as opposed to application-independent ALTER scripts.

2. Document contents can be dynamic

This was the actual reason we settled on CouchDB: our application lets users add their own custom fields to objects, and then filter/sort based on these fields. This requires almost no programming effort (aside, of course, from the user interface involved in doing so) and is nearly as efficient as using static programmer-provided fields.

I have had in the past some experience with managing arbitrary fields on a SQL platform, mostly when I was working with open source e-commerce platform Magento. Dynamic fields involve some significant boilerplate (such as entity-attribute-value tables) and clever tricks to perform filtering efficiently.

3. The application-database impedance is lower

A typical SQL schema contains two kinds of relationships: natural relationships such as «an article has an author» between two entities that can and will usually be queried independently, and accidental relationships such as «an article has several tags» that are only present because SQL cannot store the tags in the post table. As such, extracting a post from an SQL database counter-intuitively requires one query to grab the post itself, and another query to grab its tags.

CouchDB does away with accidental relationships completely by storing JSON documents. While this might allow a performance in some cases, the main benefit is that object composition as described by the programmer in the application code is persisted intuitively, without jumping through the intellectual hoops typical in relational storage.

4. An identifier-centric application architecture is possible

What does it mean to be identifier-centric or object-centric? A function to get the full URL of an article, in an object-centric application, is a function that takes an article object as an argument (or possibly a member function of the article object) and returns the article’s full URL. In an identifier-centric application, it would be a function that takes an article identifier as an argument (or possibly a member function of the article identifier class) and returns the full URL.

Identifier-centric architectures have major design benefits over object-centric ones, with clear consequences in terms of productivity and correctness, but have a major performance problem as the same data is read from the database several times unless some very complex caching strategies are applied — that data might be read using a quite complex SQL query that is hard to keep in cache correctly.

From my experience, the vast majority of queries in a CouchDB application will either query a document by its identifier, or query a view for several key-identifier-document pairs. In short, most of the data manipulated by the application can be easily traced back to an identifier without any specific design effort. And get-document-by-id requests are far easier to cache and optimize than arbitrary SELECT requests, both at the application level (we have a temporary cache that lasts the lifetime of the HTTP request) and with key-value caches like Memcache.

This may sound like a performance argument, but it isn’t, or at least not in the traditional «NoSQL is faster than SQL» sense. It just means that using NoSQL makes an identifier-centric architecture acceptable in terms of performance.

Article image © Satoru Kikuchi — Flickr

The Art of Development Time Estimates (Part 1)

Writing software takes time, and time is money both in terms of programmer wages and in terms of delayed releases. It makes sense to try and predict ahead of time how long a given feature would take, in order to make an informed decision about whether it should be attempted, reduced or eliminated. If your job is to predict durations, make sure you understand whether you are expected to provide a back-of-the-envelope approximation — with the implication that it could be wrong by an order of magnitude in both directions — or if you’re going for a guarantee — this feature will cost no more than X days of work, unless something really catastrophic occurs, which is what a paying customer wants to know.

If your co-workers ever start using your approximations to define milestones, prepare deadlines and discuss delays, you have not insisted enough on the fact that it was an approximated answer. If anyone asks me for an on-the-fly estimate, I provide an upper and lower bound as “this will take somewhere between 2 and 10 days”. This is an outrageously wide range, but it’s fairly correct in terms of how wrong I can be with my on-the-fly estimates, and it deters anyone from just adding up the estimates to come up with a deadline. Yes, some people have tried converting “between 2 and 10 days” to “around 5 days” but I gave them the evil eye every single time. If anyone needs to turn an approximation into a guarantee, there is no sane reason to use anything but the upper bound.

What Could Go Wrong ?

There’s a fairly common mistake to be made with the upper bounds, and I’ve made it myself quite a lot: not being pessimistic enough. We’re lazy humans, so being optimistic is natural: we come up with a few tasks that need to be done, slap a reasonable duration on each one, and add them up. The upper bound is then pulled out of a top hat as being two to three times higher than the lower bound, because that feels right. Quite to the contrary, the upper bound must be calculated by actively looking for those things that can go wrong. Newbie programmers can usually provide a fairly accurate optimistic estimate because knowing what needs to be done is a prerequisite of being a programmer at all, but the pessimistic estimate requires knowledge of what can go wrong, which by definition is an esoteric list of accidents gathered from experience rather than rational forethought:

  • The feature involves changing some code that is unusually brittle or unstable, so time will be needed to either pay the technical debt up front and bring that code back to acceptable quality levels, or soak up the cost of hunting for bugs after the code has been changed. This is the most frequent issue I encounter when dealing with changes to existing software, because not all code is of equal quality regardless of how much effort you put into it.
  • A library does exist, but preliminary analysis failed to observe that it only supports 95% of the required feature set, so additional time is necessary to obtain the missing 5%. Several internet-facing modules in one of my recent projects use a standard library for doing HTTP requests, but I discovered late during development that said library did not support HTTPS, which prompted me to include a second library, and incur technical debt related to having two overlapping libraries in the same project.
  • The library does fulfill all requirements, but happens to contain an obscure bug that prevents the feature from working as expected, so more time is spent trying to work around the bug and get the library authors to fix it. This is especially nasty when no replacement is possible, such as errors in database servers.
  • The code works as written, but QA testing reveals massive performance issues on typical user input, and time is required to correct the issue. On an older project, I used a jQuery plugin for handling five-star ratings, with a single rating component costing 300 milliseconds in initialization — nothing noticeable on our test pages where only one component was used, but it brought the page load time to an unacceptable three seconds because users created feedback polls with dozens of such components.
  • The programmer who implemented the first half of the feature is ill, on vacation, fired, fighting fires on another project, demotivated, stuck in the snow or otherwise unavailable. Another developer is brought in and needs to spend some time getting familiar with the half-completed code (and getting that uncommitted code from the unavailable developer’s laptop was, in itself, a delay).
  • The programmer who implemented the feature delivered an incomplete buggy product several days late.
  • The programmer misunderstood the requirements and implemented the wrong feature.
  • An unforeseen edge case is detected that has severe consequences on the application architecture. For instance, a given server-side process was assumed to be synchronous but is discovered to be asynchronous with latencies of several minutes on high server load. This makes the original plans for a five-second loading page obsolete, and calls for a costlier, asynchronous “we’ll start working on this and notify you when we’re done” user interface strategy instead.

The list goes on. Think of it as a shopping list you can go through when coming up with a pessimistic upper bound — start with the lower bound and add possible accidents.

Announcing a “between 2 and 10 days” range out of the blue can sound ridiculous, but it is quite less so when it’s actually backed by a list potential problems. Eight days spent working around library issues, obscure edge cases and performance problems is actually pretty normal from my own experience if these problems do come up.

Stay tuned for the next issue, where I will discuss how to work with your team and your stakeholders to lower those estimates.

Article image © Alexandre Pereira Flickr

Agile Code in OCaml

First, a quick bit of background: I’m working on RunOrg [fr], a start-up that provides communities with their own online private social networks à la facebook. The technology stack is Linux-Apache-CouchDB-OCaml, and this has some implications that I will discuss below.

Facebook has it easy in terms of user management: an user starts existing on their platform the instant they sign up, at which point they fill in their first name and last name, and these are displayed to anyone who is allowed to see any hint of the user’s existence. So, making the first name and last name mandatory are quite acceptable.

At RunOrg, we cannot do this for several reasons:

  • User profiles may be created by communities as part of the membership management toolbox: we have to rely on user A to provide data about user B, and user A usually relies on an email-only source (such as newsletter or mailing-list registrations) where no first name or last name is available.
  • A given user may be part of several independent communities, and may choose to manage their identity separately for each one: appear as John Doe in an innocuous community they trust and John Censored in a more critical community.
  • We also allow users to keep control over whether a community is allowed to publish their name on the internet (as part of the online directory, or as comments on public articles).

Our needs for advanced privacy controls involve a more complex management of both what data is available and how we display it. The good news is, it’s certainly possible to handle all of this elegantly in terms of implementation. The bad news — we didn’t plan everything ahead.

It was only a few days ago that our customer requirements sessions brought up the issue of email-only sources: community managers were frustrated by the fact that our mass import functionality required first names and last names. The problem is, in almost every single programming language out there, making a required field become optional is a very dangerous endeavor because the development team must audit the entire code base to identify which parts of the code assume that the field is required, and describe what should happen when the field is null.

In your average PHP project, try making the user name optional and I can assure you that sentences like «You have been invited by  to this event» will appear. Someone failed to audit the who-invited-you-to-events code. At least with Java or C# you will get a Null Reference Exception of some kind that will show up in the logs and give you the opportunity to hunt down the mistake.

The good news is that our implementation language OCaml, does not allow null values. Instead, optional values are handled using a different value type, known as 'a option, which changes everything. An optional value simply cannot be accessed in the same way as a non-optional value. Trying to do so anyway will cause a type error that is picked up by the compiler, so a programmer can rely on these errors to quickly identify all locations the code that assume the value to be present.

I’ll say it again: in OCaml, a field being optional or mandatory is an assumption that is build into the type of that field, so changing the assumption involves changing the type and breaks all code that does not match the new assumption. Applying breaking changes to an OCaml code base is usually as simple as following a trail of compiler errors.

So, that’s what we did. We already knew that the behavior we wanted was to construct the “display name” of users like this:

  • If either the firstname or the lastname are present, use them (if both are present, use firstname-whitespace-lastname).
  • If none were present, then the private display name (visible only to the user themselves on their profile page and in the e-mails they receive) should be their e-mail address and the public display name (visible to everyone else) should be the username part of their e-mail address (so john.doe@gmail.com is shown as john.doe@…)

First, we defined two functions that compute the private and public display names based on the first name, last name and e-mail of the user. Then, the compiler error trail led us to all locations where a change was required, where we quickly identified whether the public or private name was to be displayed and replaced the existing code with our new display name functions. In total, a full audit of 40kLOC was done in less than an hour and I have proof that any code that uses the user name now handles the case where the user name is not provided.

The Rules

When working on any OCaml project, and especially on RunOrg, I follow these few rules:

  • Any assumption must cause a compiler error when broken. Either the code determines on the spot that the assumption is true, or I use the type system to prove that another part of the code already did. This rule took a massive toll on my early productivity, and I attributed it to an inherent cost of making compiler-enforced assumptions, but the real reason was that I was still pretty new at it — the elementary assumption enforcement from my smaller projects was too crude for the needs of the richness of RunOrg functionality, and it took me six months to refactor my early approach into an elegant and streamlined strategy of encoding assumptions into types.
  • Don’t work around the compiler or cheat with semantics. The initial reaction to a system that complains about every little change you make is to try and work around it by using more generic types or storing information where it does not belong. For instance, an easy solution to the optional name conundrum would have been to store john.doe@… as the name, but doing so would have been semantically incorrect (that’s a placeholder, not a name) and would have polluted the database “name” field with things that are not names and that will be treated differently from names at some point in the future.
  • Don’t accept mediocre code or patterns. Sometimes, design choices in the interface of module A will lead to ugly code in modules B, C and D because an unforeseen usage pattern happens to apply 95% of the time and the interface of module A was not designed with that usage pattern in mind. No amount of cleanup or refactoring in modules B, C and D will solve the problem, the only solution is to go back to the design of module A and change the interface even if it means that two hundred client modules will break. Keeping my code clean, elegant and short is worth wading through two hundred modules.
  • Perform lazy payments on your technical debt. I can propagate new design changes through your entire code base in one coding session, but this doesn’t mean I should. Instead, I keep a mental todo-list of all the changes that need to be applied, and apply all of them at once, locally, whenever I have to rework a given piece of code for any reason. While it may seem that such a todo-list is hard to keep and I will inevitably forget parts of it, remember that those design changes came around in order to solve the problem of ugly or mediocre code — by noticing that the code is ugly, I am reminded of the strategies that I set up in order to clean it up.
  • But be eager with small payments. If it’s a matter of moving a few functions around or refactoring a small piece of code, I do it as soon as I am done writing or rewriting it. Cleaning up little odd bits in a mostly clean code base is extremely rewarding.
  • Discover code by trying changes out. If the assumptions are correctly laid out, then the easiest way to determine the implications of a change — whether it will work and how long it will take — is simply to try it out. Following the compiler error trail will quickly reveal how many things are impacted by the change, as well as any unforeseen massive consequences. If it turns out that the change is too impacting, I just roll back my edits.
  • Keep interface patterns to a minimum. The basic idea behind having few different interfaces implemented by many parts of the system is usually expected to be «code is easier to reuse» but I disagree. Yes, that is a frequent benefit, but certainly not the most essential. Having few different interfaces means that most of my code can be described using a small vocabulary of interface patterns, and that looking at some code immediately reveals the pattern being used there. It also means that any design changes can be expressed in term of pattern changes, and can be applied almost blindly to all locations where that pattern was used. Last but not least, by using a simple shared vocabulary for large sections of the applications, I make it easier to recognize patterns in the more chaotic sections based on how they interact with the cleaner code. It’s easier to determine that two sentences have the same meaning if they share some words.
  • Love your code. In the RunOrg code base, priority 3 is making sure the code is well-designed, clean and free of technical debt, priority 2 is adding new features, priority 1 is making sure there are no bugs, and the drop-everything-you-do-and-work-on-this priority zero is that I should never hate working on the software. Motivation is paramount to keeping the code clean, feature-rich and bug-free, and even to working on the start-up in the first place, so anything that might make me question my dedication to the project or cause me pain while working on it must and will be corrected as soon as possible, regardless of other priorities.

I’m pretty certain that all of the rules are important, but I do believe the last one is an absolute prerequisite.

Article image © Ergonomik — Flickr



1342 feed subscribers
(readers who polled a feed this week)