Tag Archive for 'Web'

Node.js is Aquarius

@Ted Dziuba : your article on Node.js being cancer has brought many angry nerds with pitchforks to your door. You do make good points, and the best opinion is not one that everyone blindly agrees with, but one that gets everyone thinking — hopefully before they speak.

Scalability

I, too, would take issue with a statement like «Node.js is scalable because it is non-blocking» though not the same issue as you took. Being non-blocking does not help with scalability at all. Scalability is about how easily your system administrator can add a new machine to your web farm to soak up a heavier load than usual, and it’s all about two things:

  • Can you run multiple copies of your software in parallel? In-application sharing of data makes this harder. Some Java servers store the entire relevant state in application memory, so scaling is impossible. PHP stores session files on the disk by default, so scaling is only possible with server affinity (the same user always gets sent to the same server). A clean server with no in-application data sharing is easily duplicated, regardless of the language.
  • Is there a shared resource with sequential access? If you run a hundred thousand web servers, but all of them have to read-write the same physical drive, then your application will be no faster than that read-write speed. If you access a database that involves heavy locking, then your application will be no faster than the locking sequence can allow.

None of these are in any way improved or even affected by non-blocking semantics.

Node.js improves performance when serving multiple concurrent requests. It makes it no easier to scale, but it helps delay the point where scaling becomes necessary.

The typical explanation of how this happens is that if serving a request uses 10ms of processing things on the server («Work») and 10ms of waiting for database requests to complete («Wait»), then the ideal web server should be able to serve two concurrent requests in 10ms each by overlapping the processing time of one request with the database wait time of another. This is a pretty nice and simple idea, which is why everyone has been doing it for ages. The main difference is how it is done.

What the traditional UNIX world did is pop enough processes — that is the Unix answer to every problem, including having too many processes around. If your Work-time is 10ms and your Wait-time is 40ms, then by allowing up to four processes you are effectively recycling all the wait-time in a high concurrent load situation. This is why every CGI- or FastCGI-enabled web server in existence provides a configuration entry for the number of concurrent child processes.

Node.js does the same. With that same Wait/Work ratio of 40/10, Node.js will be serving four concurrent requests at the same time, because it cannot create processing time out of thin air.

What Node.js brings to the table is an architecture that performs, at the server level, what the traditional UNIX world did at the kernel level: scheduling. Whether this approach is significantly faster than a properly configured FastCGI setup is still a matter of debate, and I believe the answer here is simply that, as long as the Wait/Work time ratio does not push the number of concurrent processes higher than what the available memory allows, there will be no significant difference between FastCGI and Node.js in terms of blocking.

The UNIX Way

I once agreed with your stated opinion on the matter, but I got better. Here’s the thing: today, being an HTTP server is no more of a «responsibility» than reading from STDIN and writing to STDOUT. Make no mistake: being a production, internet-facing HTTP server is a responsibility, but that is not what Node.js is (or should be) trying to achieve.

Consider this: the production, internet-facing HTTP server must communicate with the actual application using one protocol or another. CGI is one such protocol, FastCGI is another, and HTTP is yet another — the fact that the same protocol is used for serving requests over the internet is not  a problem, it is actually a benefit because communicating through HTTP is a solved problem with a clean API in almost every single language out there.

There is now something I would jokingly call «The REST Way» which follows in the tracks of the UNIX Way in a cloudy fashion : small applications performing one task — dispatching internet requests, constructing responses, persistent storage, caching — running on any number of servers in any number of locations, and connected to each other through HTTP requests. In an nginx-Node.js-CouchDB stack, nginx is the dispatcher, Node.js constructs responses, and CouchDB provides persistent storage, and everyone «speaks» HTTP in the same way that Unix processes «speak» STDIN/STDOUT.

Article image © Patrick Janicek — Flickr

The Plans & Pricing Page

While creating my own Plans & Pricing page, I collected screenshots of those same pages from a variety of companies. This helps illustrate both what everyone is doing, and how some of them are innovating on the matter.

backupify.com – backups for cloud data

basecamphq.com – online project management

bitbucket.org – online code hosting

ginzametrics.com – SEO analytics

github.com – online code hosting

huddle.com – online project management

raventools.com – SEO analytics

rhapsody.com – cloud music service

seomoz.org – SEO analytics

I’m Going to Miss the Internet

My first dealings with the internet went through a 56k modem. I had to find and save pages to the computer to browse them offline in order to avoid the large phone bills that came after you stayed online for too long. These days, I have five computers plugged into a single fat pipe at all times, with more bandwidth than I could ever use, at one hundredth of the former cost. But still, as the internet and the computing world improved and matured, some key aspects were lost.

Browsing the internet used to be an anonymous activity. As you came online, you were awarded an IP address, which acted as your avatar in your dealings with other computers on the network. There was no way for anyone on the internet to reliably trace any kind of online activity back to your real-life existence, because there was no link between IP addresses and human beings. Even if someone did find out that you owned a given IP address, you could still argue that it had belonged to someone else when the activity took place. Sure, a handful of countries that were known for their human rights track record could play Big Brother with their citizens, but I lived in a first world country that would certainly respect my right to privacy. I was wrong. Browsing the internet in France is no longer anonymous, as internet service providers are required by law to log the owners of every single IP address they allocated. There is now a link between your IP address and your name and home address, and government agencies may follow that link to hunt you down.

I used to believe that the Internet was immune to such tampering because it was decentralized, that the RIAA and MPAA were fighting a losing uphill battle, that any attempt to restrict online freedom would be voided by technical counter-measures and workarounds. This belief was epitomized by John Gilmore in his 1993 quote:

The Net interprets censorship as damage and routes around it

This warm feeling of eternal resilience relied on a single assumption : almost every single data transfer technology can be abused to transfer illegal data (the latest Lady Gaga single, child pornography, mentions of Tian’anmen Square), and the government cannot afford to outlaw all data transfer technologies. I call this the Collateral Damage Assumption — any effective solution would involve too much collateral damage to be implemented by lawmakers. But this assumption, as self-evident as it may seem in a first world country, is incorrect.

Subtle side-effects

One reason why this assumption breaks down is that lawmakers only care about flashy, obvious side-effects. They honestly believe they can get away with subtle side-effects, so they will settle on solutions that hide away the collateral damage so that taxpayers will not notice it until it is too late. I have an actual example here, so bear with me.

A few years back, copyright owners spied on peer-to-peer networks to identify the IP addresses of illegal downloaders, traced those back to the actual names and home addresses of real-life people, sued them for infringement, and failed because there was no proof that those people were actually guilty of downloading copyrighted works, as opposed to merely being the unlucky owners of a hijacked WiFi network — it takes a few minutes and a few dollars to hack into a secured WiFi network, not to mention all those open WiFi hotspots in various restaurants and institutions.

Then, the law that became known as HADOPI was introduced. Among other things, the bill made it a misdemeanor to connect to the Internet a device that is insufficiently protected against malicious users. If a copyrighted work was downloaded from your IP address without your consent, then you failed to protect your internet connection against that malicious user and you would be sentenced for the misdemeanor. Can you swear that your home network is secure? Do you regularly change the WiFi key, keep your router firmware and operating systems up to date, and monitor your traffic for any suspicious activity? Me neither, and I suspect the average Internet connection owner does not even understand what changing a WiFi key involves.

The media and several activist groups made a fuss about the fact that the sentence carries the possibility of being barred from owning an internet connection for an entire year. That’s annoying and extreme, but certainly not the main issue.

Few recognized this law for what it was: reducing the number of false negatives (letting pirates off the hook) at the cost of having more false positives (punishing helpless, innocent people). But those false positives are a subtle side-effect: the only people who notice are those directly affected by it, and those with the technical skills to understand that securing an internet connection is hard. Outside of well-informed technical circles, the general opinion on the HADOPI remains that you will only be punished if you download copyrighted works.

And there were even subtler effects. One of them was that many pirates, aware that they were at risk of being discovered, started using encrypted file sharing protocols in order to evade detection. This significantly increased the amount of encrypted data over the network, because downloading the latest episode of The Big Bang Theory uses more bandwidth than all your HTTPS browsing and SSH terminals combined. Needless to say, the NSA was less than happy about having a lot more data to sift through to when looking for terrorist threats.

While on the topic of subtle collateral damage, there is yet another example, this time in an otherwise fairly decree by our government. Around these parts, laws provide a general framework, and decrees are then used to fill in the details such as what forms should be filled, how much money must be paid, or what data is covered by “should keep the relevant information for at least one year”. In this case, the decree asked for user passwords to be kept around for at least one year, going against the fundamental principle of password security which is to never store user passwords, ever. I’m fairly certain that the people who added “and passwords” to that decree had absolutely no idea that this was an insanely bad idea, and I suspect that it would take quite some time to explain exactly why it’s such a bad idea.

General Misunderstanding

In the end, we live in a world where only a small technical elite can hope to understand the consequences of such decisions — and that is when we do agree with each other. Decisions by the unsuspecting lawmakers, unopposed by the uninformed general population, can ultimately hurt the Internet in subtle but permanent ways.

This week, the Queensland police likened receiving photos to taking stolen television sets. This is a pretty good analogy, except for the fact that 1° you cannot make a copy of a stolen television by clicking a button and 2° you do not receive thousands of television sets (stolen or otherwise) on a daily basis while browsing the web.

The easiest way to explain computing concepts to normal people is to use analogies, and all analogies are inherently flawed. Hilarity ensues when the analogy is taken to its logical but incorrect conclusion.

To make sane decisions, instate sane laws and pass sane judgements on the computing world, working by analogy is the last thing you want to do. Copyright infringement is not theft. Privacy invasion is not theft. The only acceptable way of dealing with the complex technical concepts around us is to determine their consequences in the real world, and decide based on those consequences.

What are the real-world consequences of a journalist receiving unauthorized Facebook pictures when writing an article about the security issues that allowed the pictures to be obtained in the first place? Are any of these consequences worth arresting the journalist and confiscating his property?

Being Left Behind

There’s another reason why the Collateral Damage Assumption is incorrect. We say to the computer manufacturers “let us install any software we want on our computers, or you will kill the economy” and thus we retain the right to install any software. Can you imagine the next version of Windows refusing to install any kind of peer-to-peer software? That would require some heavy restrictions on installing new software, so no one would buy it.

There was no collateral damage to Apple deciding that all applications on the iPhone must be accepted by the App Store first. They defined a new market and set their own rules, and most people accepted this situation without flinching.

We praised the Internet, and the computing world, for their versatility, for their ability to evolve around any obstacles in their path. But we assumed that this meant those features we held so dear would remain forever. This is completely wrong : the world will move away from any features that do not fit in anymore. I assumed that I would forever be able to participate anonymously on various online communities, but they are starting to use Facebook Comments because there is now a critical mass of people who 1° use Facebook and 2° don’t care about writing things in their own name on the Internet. The “mainstream Internet” has already given up on many earlier features I took for granted :

  • Browsing without cookies or javascript. Now, sites require these even if you do not have an account.
  • Interacting anonymously or with pseudonyms. Now, you need to use facebook.
  • Dealing with many small tools and communities. Now, there are a handful of huge “cloud” conglomerates and communities.
  • Content placed online by competent experts. These days, anyone can create a blog to share they’re [sic] mistakes with everyone else.

As with anything that evolves, nothing is forever, not even those things that we though the Internet could never exist without.

The Internet isn’t dying. It’s becoming something else that I’m not entirely happy with.

Missing the Point of Facebook Connect

Facebook Connect lets Facebook users connect to third party web sites using their Facebook account, and provides the third party site with limited access to the private information of the visitor. Many sites implement it in order to increase the conversion rate from anonymous users to connected users. The main issue is how greedy those sites are going to be with private information access — and how disproportionate the access requests are in some cases.

How much access I’m willing to grant a given site obviously depends on how I intend to use it.

In this aspect, someecards.com is unusually silly. They’re an online greeting card site that is surprisingly lightweight in terms of ads, and they mostly rely on you sending online greeting cards to your friends to bring them to the site. The feature they implement using Facebook Connect is posting a greeting card directly to a friend’s wall, something that obviously needs me to authenticate as the owner of my Facebook account and allow them to post to my friend’s wall. What someecards.com actually requires me to do is:

  • Grant them access to my public profile information : by authenticating, they know who I am and can subsequently access my public profile. Because it’s, you know, public.
  • Let them send me e-mail directly at my own address. Why?
  • Let them access my birth date. Why?
  • Let them access the birth dates of my friends. Why?

Out of four access requests, three of them have no relationship whatsoever to what I am actually trying to do — posting a greeting-card to the wall of a friend. This makes Facebook Connect look less like a helpful feature and more like a troll guarding a bridge. Whatever you do, do not use Facebook Connect like this!

Of course, why they are asking for this is quite obvious: they are actually having me create an account with them behind the scenes, and that information is needed to power their birthday calendar feature, which I suspect involves sending me e-mail to remind me that I should send a someecards.com greeting card to a friend for their upcoming birthday. And, again, they are trying to funnel users into creating an account when all they wanted was to post something to a friend’s wall. These are two completely distinct use cases. Keep your act straight, please.

Naturally, what makes me say that the guys behind someecards.com are scumbags is a subtle but very interesting touch. But first, a bit of trivia about Facebook e-mailing permissions: when a third party website asks for your e-mail address, Facebook lets you choose between your actual address and a proxy address (mail sent to the proxy is forwarded by Facebook to your actual address). The entire point of using a proxy address is that, as soon as you revoke e-mail rights to that web site, the proxy address disappears and they can’t send you any more mail. If you do provide your actual address, then it remains in their database for as long as they wish to keep it there, regardless of what rights you revoke or how much you complain. As a matter of principle, I always pick the proxy address. Most websites just deal with it.

The someecards.com developers actually added a piece of code that detects if you have picked a proxy address and refuses to create your account.

It is quite subtle and requires some knowledge about how proxy addresses work, and the error message tries to pass it off as a «security requirement» so I suspect few people understand exactly what is going on when this happens.

What they just said was: «We’re going to send you e-mail even if you don’t want to receive it anymore, and we don’t want you to be able to stop us!»

Way to go, guys. Way to go.

Objective Caml Web Programming

The core RunOrg¹ application clocks in at about 30K lines of Objective Caml code, with around 2K being added every week. If you factor in our use of CouchDB, all of this might strike you as an odd choice of technologies, based on esoteric hopeful fantasies instead of cold pragmatical consideration. It isn’t, despite what others might say:

OCaml: You know yourself to be fast, smart, and extremely reliable. However, you look kind of funny and nobody really wants to talk to you. You spend most of your time sitting in a public library glaring at people, occasionally yelling “NOBODY HERE APPRECIATES MY GENIUS!” and getting kicked out.

Two years ago, I discussed the topic of using Objective Caml for web programming:

What would happen if a compact web framework were proposed? One that, in addition to borrowing existing useful concepts from other languages, also added some OCaml-specific features to the mix. Functional modules would be an interesting addition, so would be the type system and pure functional programming applied to transactions, and monadic optimization at initialization time would also be quite interesting.

Eliom

Let’s get this out of the way first. I have been continuously peeking at Ocsigen – Eliom (a web server and assorted web framework) ever since it was mentioned in a comment, and some aspects of it resonated with me while others really did not. In many ways, it served as a showcase of the many ways in which the peculiarities of Objective Caml can impact the development of a web project, and helped me decide whether these were appropriate or not. This evolved into my own rendition of a web framework, Ozone, connected to an apache server through OcamlNet2-powered FastCGI.

There were many reasons for avoiding Ocsigen – Eliom, though I do not believe any of them to be universally true. The main reason was described in Guillaume Yziquel’s comment on that article:

Somehow, even a Ruby on Rails app is a state machine. Perhaps a “better state machine”, but a state machine nonetheless, in the sense that incoming requests interact with each other by modifying the internal data.

With Ocsigen / Eliom, it’s completely different: it’s a “safely” multithreaded, compiled, application. And that makes all the difference.

Based on my experience with Ocsigen – Eliom, I fully agree with this assertion, but consider it a liability in my situation. Our business plans call for a number of users that cannot be safely expected to all run out of a single server, be it multi-threaded, for both scaling and redundancy reasons. At some point, the only communication bridge between two requests will be the database back-end, and I need my web framework to accept that and actually make sure that my one-server code will gracefully scale up to a multi-server setup.

On a more philosophical level, I agree that «On [the] server side, somehow, the “state machine” paradigm has been a hindrance», but HTTP being what it is this is a basic truth that will not go away. Eliom is building an abstraction on top of it that will continuously spring leaks whenever the disconnected nature of HTTP surfaces. This is what ASP.NET and countless other technologies tried to do and they have all made the fall back to HTTP harder when the situation did eventually ask for it.

Ozone is also a compiled application, but it has one thread and no sessions — scaling happens by launching more instances of the application and therefore supports transparently the addition or removal of servers, while “session data” is stored in a combination of client-side state, database storage and HMAC proof tokens in the URLs. While this ascetic approach cuts me off from the sheer sexy of what Eliom allows, the tradeoff is a fairly convenient set of scalability guarantees. But if you can afford all that Eliom sexy, then I have no issue with that.

Benefits of OCaml

This is why I use Objective Caml, in no particular order.

  • It’s fast out of the box — OCaml is on par with C performance as long as you don’t stray too far into sub-optimal areas (such as naive string concatenation). I can write any kind of code and be assured that it will not be the bottleneck, because database access and HTTP are a lot slower: right now, the average HTTP request takes about 80ms, with about 60ms for the actual HTTP transfer, 18ms for database latency, and 2ms for all of the Apache-FastCGI-Ozone sequence when compiled without optimizations.
  • It’s a compiled application. This one is mostly aimed at my PHP friends, where every request starts a new PHP execution from scratch — this makes it several designs impossible or impractical, such as event-based programming: this would require B to register as a listener to A’s event, which means B should be identified as a potential listener and loaded for every request even if it does not trigger the event. Once initialized, a given Ozone instance can respond to tens of thousands of requests, which makes it worthwhile to run a lot of pre-processing and pre-caching operations during initialization.
  • It’s safe. I use a programming style that relies on avoiding exceptions, never using wildcards, defining many new types for almost everything, and writing pure functional code. This eliminates entire realms of bugs : using the wrong variable, forgetting to call a function or catch an exception, being surprised by a sneaky side-effect or doing things in the wrong order… About half the bugs I caught using Unit Tests don’t exist in OCaml (null reference exceptions, anyone?) and the other half is eliminated by my programming style — so I don’t write unit tests anymore (well, I do write an automated “test” every time I find a bug, but it’s usually as simple as adding a type annotation). This also lets me routinely refactor literally half the application every other week, without causing any bugs.
  • It’s concise. Most of the features I write are a matter of a mere hundred lines — most of the code is related to my obsessive need for being explicit. Being a functional language, you can define a brand new anonymous function on the spot and throw it into another function that is returned by yet another function which is then given to yet yet another function, all of it being implicitly type-checked without having to define a single IAcceptsBoxObserver interface or LeafBoxObserver implementation.
  • It has a fast compiler. Building those 30KLOC from scratch takes less than a minute — the average incremental build takes one or two seconds. Whenever I have any doubts about what I’m writing, I can just ask the compiler — Hey, did I forget anything about this function call? Why yes, master, you forgot to check that the user was indeed allowed to reply to that message.

The most essential feature is complete compile-time safety. As a web programmer, I have to be careful about hundreds of small details — can this text be translated into another language? Is this user allowed to do what they just did? Did that object disappear from the database while you were editing it? Does that URL really correspond to an actual page? Did you remember to check for script injection in that piece of HTML? Is this GET parameter available at this point in the code? Is this object available or locked by another user? Did I forget anything else? It’s impossible for a human brain to think about all these things while at the same time creating an elegant design or refactoring a piece of code or writing a new feature. I can use the flexible OCaml type system to check for all these details through appropriate design of the Ozone API, which turns the development process into a game of 1° write the simplest code that works, 2° listen to the compiler’s suggestions for making it fail-proof. It’s a game that I’m becoming fairly fond of, and it lets me concentrate on the very core of what I’m trying to do.

Disadvantages of OCaml

It’s not a happy fun place. Quite the contrary: the language comes with a set of annoying quirks and flaws that do make things harder. Before you jump in, you should know what to expect.

  • Type-safety has a price. If the type system cannot express a certain thing, then you can’t do it. There are a few fairly complex examples where this has caused me trouble, in areas such as optional function arguments, module meta-programming, JSON serialization or dynamic database-driven data structures. Workarounds exist, but they’re only workarounds. Another side-effect is that type inference can make it hard for inexperienced developers to find an error, especially if you do a lot of strange type wizardry. Not to mention the silly yet annoying “this expression has type foo but is used here with type foo” error.
  • Lack of tools and libraries. Being a non-mainstream language means there are no heavily tweaked and highly evolved tools available (think about the wealth of tools available for C# or Java development), which gives a certain clunky feel to development. Besides, many libraries which are taken for granted in the mainstream world are missing or non-documented — try connecting to the Facebook API and you’ll notice that not only there is no Facebook SDK in OCaml, but there is also no documented way of using HTTPS. The same goes for Amazon S3 and MD5-based HMACs, by the way. And iconv functionality. And removing the X-Mailer header from e-mail you send. The list goes on.
  • It’s not object-oriented. You can use classes and mutable objects — it’s a viable implementation strategy, but it also bears a lot of the typical issues encountered in the mainstream programming languages, and it lacks the conciseness of functional approaches (defining a class and instantiating an object is bound to be longer than a lambda). If you’re not in the right mindset for using the language, you will miss on a lot of the benefits.
  • It’s not popular. It is a disadvantage, just not a technical one. As a programmer I couldn’t care less about the popularity of my language because, you know, COBOL was very popular once. As a hiring manager, I am aware that using a non-popular language will make hiring developers harder. As a start-up founder, I know that this reduces my chances of selling my company because esoteric technologies are a risk to potential buyers.

There are also many tiny quirks in the language that I hope would eventually be solved. For instance, there’s the absence of a shorthand notation for the ubiquitous (fun x -> x # member). There’s also the lack of C#-like properties, with a pure functional twist:

val x = init

method get_x    = x
method set_x x' = {< x = x' >}

And, of course, there is a lot of things going on with the option type that BatOption just isn’t up to expressing concisely. The P4 preprocessor could be applied to these situations fairly reasonably, but I would feel more comfortable if they were built into the language (and syntax highlighting tools).

In conclusion, OCaml + CouchDB provide our team with the flexibility required to build new features frequently without being afraid of subtle bugs or regressions, and to regularly refactor our code into a more amenable mess. It is a level of compiler-provided safety, surgical refactoring and bug detection that would be simply unavailable with C# and Java (and hopeless with PHP, Python or Ruby).

¹ RunOrg is my Start-Up ; we provide an online tool that helps associations, unions, organizations and communities manage their members, contacts, activities, events, knowledge and online presence.

Don’t Push – A Small Review of Cache Strategies

The standard behavior of most cache system follows these steps:

  • Attempt to read needed data from the cache
  • If data is missing, compute it and place it in the cache
  • Return the data

This is a fairly streamlined process that’s easy to add to almost any single algorithm that constructs data. The cache could be local (it’s part of the application, or even of the current function call), it could be dedicated (memcached), or the data might be persisted back to the database (such as adding the number of files in a given folder to the folder object itself instead of counting them every time).

The root of this strategy is the principle of memoization: if a function is pure — that is, calling it with the same arguments twice will return the sale result twice — then you can place such a cache in front of that function so that it will only be called once for every argument.

Memoization obviously found its way into RunOrg, because it’s literally a one-word optimization hint that trades memory for performance where it matters. In practice, in a web application like RunOrg, the only really costly computation is sending requests to the database, which is by definition not pure. Still, I can usually expect that for the duration of a single HTTP request, the database contents will remain reasonably stable, so I can create a temporary memoization cache when the request starts and drop it when the response is sent. Actually, I’m using a slight variation on standard memoization which is batch memoization: in order to improve performance, queries for objects A, B, C and D are represented as a single batch query to the database asking for the list of four objects. With standard memoization, if I asked for objects A, B and C, then for objects B, C and D, then six objects in total would be requested (because those two lists are different). By extending the memoization algorithm to know that list elements are independent, I can have the second query ask for only D, and retrieve values for B and C based on the first request.

Outside of this situation, however, functions can hardly be considered pure, so special steps must be taken to keep the cache up to date. This results in three common strategies.

The first one, which is the cache expiration strategy, is to give up on data freshness and decide that data in the cache will survive for a fixed duration regardless of whether the actual data has changed or not — so, instead of declaring that data is always up to date, it declares that any change will be visible everywhere in less than X seconds. While somewhat weak, this strategy is particularly effective because it does not need any kind of knowledge of the relationship between the cache and the underlying persistent data — the only connection between the two is the compute-if-missing steps outlined above.

Once you decide to handle the relationship seriously, two more strategies become available. The cache invalidation strategy is activated when the underlying data is changed, and invalidates all cache items that are dependent on that data. Thus, subsequent requests for those items will trigger a cache re-computation and always serve fresh data. Of course, this means that the cache system can easily tell which items should be invalidated. This is fairly easy in a one-to-one data-to-cache mapping, but as pieces of data can be mixed and matches into various cache items, this requires an unusually complex architecture to handle.

A nice design trick to keep in mind is that you don’t always need to find all that data — sometimes, you can simply «lose» it: for instance, web caching uses a combination of expiration and invalidation strategies. When a file  is sent to a browser through HTTP, it sometimes carries a header explaining when it expires. This is useful when all the pages on your web site use the same CSS or JavaScript files, because then your visitors will only need to download them once and will use the cached versions from the on. To handle changes in CSS and JavaScript files, some web sites rely on cache expiration (a few minutes or hours of cache lifetime, so that any changes are detected soon enough) while others use cache invalidation. Obviously, the web site can’t go and notify every single browser that the CSS files have changed (especially since some browsers are closed or offline) so it will simply lose the data: while the CSS file named style.css?0001 will remain in cache for up to a month, the pages on the site are now asking for style.css?0002.

The third, the cache refresh strategy, is a variant on the former: instead of merely invalidating the cached data, this strategy computes the new data and places it directly into the cache. This is necessary when the data is frequently accessed and the computation is long: if one hundred users come asking for the data while it’s invalidated, then all of them will compute the data as part of the “if missing, compute” step of the caching process, which will probably bring the server to its knees — what people call a stampede — so the only safe thing to do here is to keep data in the cache at all times and replace it with a more recent value whenever necessary.

Another nice design trick is to use a flexible expiration date to turn a cache invalidation strategy into a cache refresh strategy: instead of invalidating the item by removing it from the cache, merely set its expiration date somewhere in the past. Then, to avoid stampedes, whenever an user detects that the cache is expired, they set the expiration date to sometime in the future and start computing the data. So, the first reader will notice a delay (as his request will reconstruct the cache), subsequent readers will instantly read the old cached version without triggering a stampede, until the first request ends and the cache contains the new version of the item. To choose between the two: if the event that triggers the refresh provides you with data that improves the time required to update the cache, then update it, otherwise merely invalidate it and rely on your stampede protection to do the job.

There’s one last situation that makes caching complicated, which I’ve recently had to handle with the RunOrg application — indexing. Suppose you have a huge amount of data that you need to wade through, nicely split into separate objects that each have their own responsibility (my profile, my membership information, my participation to event X, my answers to poll Y…) but you sometimes need to virtually aggregate all that data and traverse it to retrieve only certain parts of it: give me the name, premium-or-not-premium membership status and answer to “T-Shirt Size” in poll Y, sorted by the registration date to event X. Yes, that’s one of the many queries that RunOrg lets you do (and you can even print out that data to serve as a list of participants). Now, trust me, there’s no sane way to dynamically run queries of the sort on a clean normalized database and still get reasonable performance. So, you need to create an alternate, denormalized representation of all that data and keep it in cache to avoid re-computing it for every request.

The problem with such a cache is that you cannot afford to re-construct a thousand lines of cached data because one user changed their T-Shirt Size answer, so there can be no high-level validity check. Basically, if you can’t trust the cache to be up to date when you run your read query, you lose. The traditional “try to read, if invalid update” approach to caching goes straight out the window. You need a solid cache refresh implementation that pushes the most recent data into the cache as soon as it becomes available.

RunOrg uses a variant — the cache pull strategy. This is merely a small semantic shift, but it’s quite helpful: in the original cache refresh situation, the data model needs to be aware of the cache, because it must actively send data to the cache whenever a change happens. With the RunOrg variant, the data model merely publishes a “data was updated” event that the cache module may listen to and react by refreshing its contents. So, the knowledge of how to extract data from the model and place it in the cache now belongs to the cache module instead of being spread over both the data model and the cache module. This not only makes the code cleaner — the data model becomes cache-independent and thus easier to read through, with cache modules being tacked onto it using the event system — but it also lets the cache react to events from different data models: an item might be updated when the profile changes, or the membership information changes, or the event X participation changes, or the answer to poll Y changes… and will have to read data from all four to compute the new value anyway.

Obviously, the cache pull strategy is a more complex architecture than the previous one:

  • You need an event system — the entire contraption hinges on the fact that a cache module can listen to the changes that happen in a data module.
  • Your cache module must track the dependencies of each item, in order to update that item when it receives a change event for one of its dependencies
  • You need asynchronous processing, as pulling values for dozens of items simply cannot be done as part of the standard HTTP response cycle
  • You need to follow clean multi-process patterns to handle simultaneous updates of some items

Still, given the performance we achieve with this approach, and the clean code that results from the underlying events-and-async structure, the results are certainly worth the efforts.

RunOrg is my Start-Up ; we provide an online tool that helps associations, unions, organizations and communities manage their members, contacts, activities, events, knowledge and online presence.

Amazon-WikiLeaks sets a Scary Precedent

For those of you living under a rock and not having an internet connection down there, here’s the story so far:

  • WikiLeaks, originally hosted in Sweden, announces that it will publish several hundred thousand U.S. classified documents.
  • A hacker runs a denial of service attack on WikiLeaks, bringing them down.
  • WikiLeaks uploads some of their data to Amazon’s S3 file hosting service, and goes live
  • Amazon pulls the plug on the WikiLeaks hosting within 48 hours.

I will not under any circumstances condemn or condone what either WikiLeaks or Amazon did there. That topic is too complex for me (and, I suspect, most people) to form an adequately justified opinion, and my biased unjustified opinions are best kept off the Internet.

On the other hand, what Amazon did was terrifying. After toiling for years to convince the general business public that moving to the Cloud does not imply accidental data loss or vicious hackers accessing your secrets, Amazon have reminded us of a basic, uncomfortable truth: they who handle your data can kill you on a whim.

«But WikiLeaks is not dead!»

I know. Keep in mind that WikiLeaks team has backup copies with strong encryption stored by a multitude of anonymous individuals, access to international hosting in a variety of safe havens, a dedicated team of sysadmins on call to move around the site and the data whenever something dies, and a willingness to fight for the availability of that information even if it entails going to jail. The reliability of their data storage exceeds that of almost any other entity on the planet, including Amazon S3. To them, having their hosting shut down is a minor inconvenience. To a normal business with their data to the Cloud, and all the bills, orders, paychecks, contracts and documents for the last year are lost: it’s an unmistakable death sentence.

How can Amazon S3 do this? Here’s the relevant part of the Amazon Web Services customer agreement:

3.4. Termination or Suspension by Us for Cause. We may suspend your right and license to use any individual Service or any set of Services, or terminate this Agreement in its entirety (and, accordingly, your right to use all Services), for cause effective as set forth below:

3.4.1. Immediately upon our notice to you in accordance with the notice provisions set forth in Section 15 below if:

[...]

(vii) we receive notice or we otherwise determine, in our sole discretion, that you may be using AWS Services for any illegal purpose or in a way that violates the law or violates, infringes, or misappropriates the rights of any third party;

This grants Amazon the right to terminate your service by snapping their fingers (and sending you an email) if there’s any hint of you doing something that might be construed as illegal.

«You’re another guy who stumbled upon a piece of legalese in a customer agreement, misunderstood it, and tells everyone how evil that corporation actually is…»

No, I’m not. I knew this termination clause had to be in there before I even looked, because it’s a fairly standard one and even my own business has it. Amazon needs this part to be able to eliminate child pornography or copyrighted books/songs/movies stored on its servers without waiting for a judge to determine that the content is actually illegal. There’s nothing evil about having that clause, and the reason we accept this situation is that we expect Amazon (and any other service) to use this power responsibly: as long as you don’t store any illegal files, you need not fear anything.

Keep in mind that while obtaining those leaked documents was illegal, distributing them has not yet been ruled illegal. It might happen in the near future on the grounds that it endangers individuals and/or governments, or it might end up under the protection of the First Amendment, and there seem to be fairly intelligent and reasonable people arguing for both sides.

You just moved your data/computations to Amazon to eliminate any data loss or denial of service risks. But now, there’s the risk of Amazon shutting down your account — what are you doing to make sure that isn’t going to happen? How do you intend to get back up once it happens? Is it really worth it?

The Two Hour Miracle

Yesterday evening, I worked on the professional website of Alix Marcorelles. My mission was to create an online version of her résumé, with a professional feel and the appropriate SEO voodoo, based on a high-resolution picture of her and the PDF version of her resumé.

Fast Quiz: how long do you think it would take your usual IT contractors to achieve the exact same result, if you gave them a picture of that page and the money to buy the domain name and the hosting? I suspect your answer will be anywhere between one day and one week, depending on how competent they are and how much overhead there is.

Two Hours

The page you see there is the result of only two hours of work, including the boring parts about buying the domain and uploading the files to the host. This is the Two Hour Miracle: someone creates in a very short time something that would take most of your usual contractors at least one day to get right, even if you specified every single detail.

I do not wish to imply here that I am the fastest web designer in the world. In fact, I suspect that I am below average. What I say is that I am experienced enough with that craft to be aware of my own strengths and weaknesses, and since I only had two hours available to set up that website, I went for a design that was the most adequate for my skill set.

For instance, the regular three-column layout I used is based on the Blueprint CSS framework. It’s a basic out-of-the-box structure that takes me about five minutes to set up appropriately (and that’s including the time to download and install the framework). Without the framework, the same 4-4-3 column layout would take up to an hour to design, implement and test, and would probably involve some math to get right.

The Icons

The icons I used were from the FamFamFam Silk icon set, because that’s an icon set I am deeply familiar with. I immediately know whether there’s a «cell phone» icon in the pack, or what icon I could use to represent «geographical location» or «Microsoft Office» because I know them all. And I’m familiar with the «list of elements with an icon to the left» pattern, so that designing the right-hand side took me ten minutes (five of which were spent looking for a LinkedIn icon). Without intimate knowledge of an icon pack, the hunt for those icons could have taken hours.

Dispelling The Miracle

I often get to work with people who have seen the Two Hour Miracle happen for someone else, or even on another of their own projects, and then ask me to do the exact same thing for them, with only a few minor modifications.

Their modifications usually involve changing the page layout so that it does not fit within the out-of-the-box Blueprint model, which turns a five-minute hack into a two-hour design battle against the dark forces of Internet Explorer. They also involve some icons that don’t exist in FamFamFam at all, such as 32×32 icons, so I have to hunt these down for a few hours as well. And they want to see what it looks like before it goes online, so I have to delay the upload until I get a green light from them. Last but not least, instead of a one-sentence mission statement, they have two pages of «minor details» that I need to read before I can start.

And now, I have become the average IT contractor who needs two days to achieve something that, to the naked eye, looks strikingly similar to the Two Hour Miracle.

What allowed me to accomplish a Two Hour Miracle was the unfettered freedom to cut my own path towards the objective, to use the tools that let me achieve a reasonable level of quality at surprising speeds. The lack of detailed specifications and micro-management helped me get good results faster.

Assembling The Miracle Workers

How do you turn your team or contractor into miracle workers? Steer away from the classic approach to technical design: I decide on the features and ask the engineers how long it takes and how much it costs. Instead, set some high-level objectives, set a limit on the time and money available, and ask the engineers what they can do.  This change of perspective leaves a lot of design work to the engineers, which means you need to select people who can work this way, and treat them in a way that helps them instead of acting like an obstacle.

  1. Find people who share your sense of quality. If you want miracles to happen, you cannot afford to be micromanaging your own sense of quality down their throats. You need people you can trust with designing on their own a solution that matches your unspoken requirements. This means they can be trusted (their responsibility: be up to the task) and that you trust them (your responsibility: let go of the details, judge them on the overall quality of their work).
  2. Find people who can think ahead. Two hours of work, no matter how miraculous, are worth nothing if they need to be thrown away to take into account a critical requirement. You need people who can plan for most contingencies through experience or natural paranoia, and see how the greater picture of your project fits in with their current objectives. This means they can plan their work (their responsibility: foresee any problems and create a solid implementation plan) and that you let them plan (your responsibility: let them know ahead of time of any critical requirements you have).
  3. Explain why you need something, not how it should work. You might be convinced that your software needs a «Really Delete?» question, but your objective is to prevent data loss through accidental deletion, and your team knows that the application model or web framework supports the superior «Deleted. Cancel?» alternate solution to that problem at a lower cost. State your objective, and trust them with finding an appropriate solution.
  4. Provide clear, immediate and non-aggressive feedback. When people get unspecified things wrong, do not take it as a sign of their incompetence or malice. They can’t read your mind and probably have their own idea of what quality looks like. Feedback is necessary to help them adjust to your ideas. Also, remain available at all times for questions—if someone needs an answer from you to get to work on their two hour miracle, don’t delay them by six hours.
  5. Allow time for experimentation. Working at a very high speed relies on only using what you are already very familiar with, to avoid bad surprises. Not only can this get boring after a while (trust me, you don’t want your team to be bored), but it also means your team does not get to improve their skills with different technologies or implementation strategies. Accept that some time will be wasting chasing technical red herrings, and let your team regularly create prototypes and proofs of concept.

While this may sound like it only applies to software design, it doesn’t. Every creative, skill-based job out there can have Two Hour Miracles (and the above list is pretty much agnostic in this regard). For instance, here’s a quote from a short piece by Jason Cohen about finding a graphic designer:

The most important qualification is whether you like their prior work. I cannot stress this enough: Designers don’t morph their style to match yours; they don’t deviate from their own style.

If they make slick, glossy, mocha-latte-modern-glassy stuff, you’d better like that. If they make crunchy, green, friendly, round-rectangle stuff, you’d better like that.

This is basically the same thing: if you ask them for something that’s not part of their Two Hour Miracle skill set, don’t expect anything done in the short time.

See what I mean?

Related Posts

To the people who have to deal with IT out there : if you agree with the above, would you consider sharing it around on Facebook or Twitter? I’d rather have this idea spread as much as possible :)

Facebook Pages vs Web Pages

If you’re doing anything that involves dealing with many people, you need to have a web presence. It doesn’t have to be a billion-dollar corporation or a trans-national association. My wedding will have a web presence because it involves several people and losing an online web site in your history or bookmarks is harder than losing a fancy piece of paper, and because a web page can provide so much more features than dead tree paste.For instance:

Where will the wedding be? → link to Google Maps (though Alix prefers Mappy)

When will it be? → click a link to add it to your Outlook / Google Calendar

How do I get there? → see a list of hotels and train schedules

Who is coming? → use the RSVP feature

This is turning into a wedding organization checklist, which isn’t the point. The real question is, should I create a Facebook Page or a normal Web Page?

Advantages of Facebook Pages

  1. It’s easy: you don’t need any technical abilities to set up and maintain a Facebook page.
  2. It’s free (as long as you don’t buy ads).
  3. You get a clean and readable page layout, a discussion forum, a photo gallery, a simple web analytics suite, and a readily available Open Graph node (something people can Like)
  4. The wall of your page acts as a multimedia mini-blog with automatic subscription for Facebook users (when they Like your page, all your updates show up in their feed) and RSS subscription as well.
  5. People trust Facebook pages, because Facebook would not allow harmful or offensive pages

Advantages of Web Pages

  1. You can use any web domain. Not having your own domain name can sound unprofessional, and it can reduce your Google Ranking.
  2. You can create a web page for anything, without being limited by the Facebook terms of use or the possibility of Facebook simply wiping out your page from existence on a whim.
  3. You can have a real blog, with updates of a meaningful size.
  4. You control your web page, which lets you include any special features that Facebook does not allow (a store locator, files to be downloaded, dynamic data, restricted areas, multiple languages, a link to a twitter account).
  5. People explore web sites: they come in non-standard formats with non-standard information, so there’s curiosity involved.

So ultimately, it’s a matter of independence versus commodity. If you don’t need the benefits or social standing of having a standalone Web Page, go for a Facebook Page instead. Otherwise, be independent, but be prepared to pay the cost (in time and money).

On the long term, having a Facebook Page ultimately serves a different purpose from your Web Page, so you should strive to have both.

Related Posts

We Don’t Care About Your Prose

So there you are, Mr Blog Author. Or Ms (I’m not very good at guessing genders over the Internet). Through devious plans and clever hacks and selling your body on the e-streets you’ve achieved what seemed impossible at first: brand new pairs of eyeballs hitting pages on your web site every day. There you are, rubbing your hands and cackling like an evil maniac in front of your Google Analytics benchmark, wondering what to do next.

«What you should do,» shouts just about every blog expert, «is let people subscribe in a variety of ways: RSS, e-mail, twitter…»

This is right. But it’s too soon. What you have now is a reader who has only read one article on your blog. Before they add you to their RSS aggregator or give you permission to send them e-mail updates or commit to anything, they will want to know whether that article they just read is typical of your abilities as an author, or if you just managed to get lucky.

So, they will click on another link, desperately trying to read another article on your blog. The second article anyone reads on your blog is the most important article they will ever read.

Silly people all around the world think it’s the first article that matters. Bovine feces, I say. You have absolutely no control over what the first article will be—this is up to the people who link to your web site. So, if a popular twitter user mentions your article about a shrimp on a treadmill to the tune of Benny Hill, this means a crowd will be reading that article as a first article. Besides writing great articles all the time, the only thing you can do is find out what articles people are being linked to, and improve the format of those articles (do not change their text: it’s dishonest and you will be called on it).


Courtesy of IttyBiz.

You do have control over what the second article is. What people can do when they’re done reading an article, ranked from potentially bestest to potentially worstest:

  • Pick a link in the «related posts» list (you have one, right?)
  • Follow a link in the «recommended reading» list (you have one, right?)
  • Click on a comment in the «recent comments» list (you have one, right?)
  • Use the «next» and «previous» links
  • Follow a link in the «recent posts» list
  • Click on the «home» link to navigate to the latest blog post
  • Go for the archives

Which one they will pick depends on whether you have these links and where they are in the layout. It’s in your best interest to have all of the links at the top of the list, and point them to the best articles you can find on your blog (I recently did this, using the number of Facebook Likes to pick them). And the real trick is this: people don’t care about your prose, what they love or hate is your ideas and your content. Unless you’re writing about prose, of course.

If people are looking for a second article to read, it means they enjoyed the ideas and content they found in the first article they read, and they need to read more.

Your «related posts» should point to similar content. Your «recommended reading» should match the theme of your blog (you have one, right?).

Always repeat yourself on your blog, in as many posts as you can. You write the damn thing, of course it feels repetitive to you. But someone who just discovered it and is intrigued by your ideas? They just cannot. Have. Enough. They want more and you should give them more!

Related Posts



1170 feed subscribers
(readers who polled a feed this week)