Monthly Archive for January, 2011

97% of French Population was Polled

… at least once in 2010 according to the MFPI [1], finally answering accusations of bias due to the choice of the population sample. The MFPI contacted a representative sample of 988 people between March and April 2010 and asked them if they had been polled at least once during the year. 97% answered that they had been, while 3% declined to answer.

Silly, right?

Sample-based polling involves asking a small but diverse sample of people a series of questions, and then using national census data (and similar nation-wide sources) to extrapolate those results to the entire population. So, if a report somewhere says that out of 65 million people, 50,000 people are roman catholic caucasian males with a higher education between the ages of 25 and 35 and living in the Paris area, and you have in your sample one roman catholic caucasian male with a higher education between the ages of 25 and 35 living in the Paris area, then you will extrapolate the answer of that person to the 50,000 others who are construed to be identical.

The nice thing about sample-based polling, when done correctly, is that on average, it works: run the same poll one hundred million times with a different sample every single time, and you will probably end up with the correct result.

However, it’s easy to get things wrong, because sometimes the opinion of 50,000 people hinges on whether the one person from that group is a lunatic. Just a few examples:

  • Your questions might introduce unnecessary variance. I used to be a frequent user of the underground lines 13 and 14. When I was polled about this, they asked what lines were involved in my current trip — which happened to be a short detour to a restaurant to meet a friend there — and ended up attributing my extrapolated package of 5,000 users to line 7 instead.
  • People might not answer honestly. If you’re perceived as a tyrant, ask any independent statistics institute from the western world to poll your population — how do you think the terrified population of urban centers will react to a phone call asking whether they love their president for life? 90% approval rating, yay!

To end on a positive note, according to another MFPI poll, 100% of my girlfriends agree that I had a great impact on their lives.

[1] My Fictional Polling Institute.

Don’t Push – A Small Review of Cache Strategies

The standard behavior of most cache system follows these steps:

  • Attempt to read needed data from the cache
  • If data is missing, compute it and place it in the cache
  • Return the data

This is a fairly streamlined process that’s easy to add to almost any single algorithm that constructs data. The cache could be local (it’s part of the application, or even of the current function call), it could be dedicated (memcached), or the data might be persisted back to the database (such as adding the number of files in a given folder to the folder object itself instead of counting them every time).

The root of this strategy is the principle of memoization: if a function is pure — that is, calling it with the same arguments twice will return the sale result twice — then you can place such a cache in front of that function so that it will only be called once for every argument.

Memoization obviously found its way into RunOrg, because it’s literally a one-word optimization hint that trades memory for performance where it matters. In practice, in a web application like RunOrg, the only really costly computation is sending requests to the database, which is by definition not pure. Still, I can usually expect that for the duration of a single HTTP request, the database contents will remain reasonably stable, so I can create a temporary memoization cache when the request starts and drop it when the response is sent. Actually, I’m using a slight variation on standard memoization which is batch memoization: in order to improve performance, queries for objects A, B, C and D are represented as a single batch query to the database asking for the list of four objects. With standard memoization, if I asked for objects A, B and C, then for objects B, C and D, then six objects in total would be requested (because those two lists are different). By extending the memoization algorithm to know that list elements are independent, I can have the second query ask for only D, and retrieve values for B and C based on the first request.

Outside of this situation, however, functions can hardly be considered pure, so special steps must be taken to keep the cache up to date. This results in three common strategies.

The first one, which is the cache expiration strategy, is to give up on data freshness and decide that data in the cache will survive for a fixed duration regardless of whether the actual data has changed or not — so, instead of declaring that data is always up to date, it declares that any change will be visible everywhere in less than X seconds. While somewhat weak, this strategy is particularly effective because it does not need any kind of knowledge of the relationship between the cache and the underlying persistent data — the only connection between the two is the compute-if-missing steps outlined above.

Once you decide to handle the relationship seriously, two more strategies become available. The cache invalidation strategy is activated when the underlying data is changed, and invalidates all cache items that are dependent on that data. Thus, subsequent requests for those items will trigger a cache re-computation and always serve fresh data. Of course, this means that the cache system can easily tell which items should be invalidated. This is fairly easy in a one-to-one data-to-cache mapping, but as pieces of data can be mixed and matches into various cache items, this requires an unusually complex architecture to handle.

A nice design trick to keep in mind is that you don’t always need to find all that data — sometimes, you can simply «lose» it: for instance, web caching uses a combination of expiration and invalidation strategies. When a file  is sent to a browser through HTTP, it sometimes carries a header explaining when it expires. This is useful when all the pages on your web site use the same CSS or JavaScript files, because then your visitors will only need to download them once and will use the cached versions from the on. To handle changes in CSS and JavaScript files, some web sites rely on cache expiration (a few minutes or hours of cache lifetime, so that any changes are detected soon enough) while others use cache invalidation. Obviously, the web site can’t go and notify every single browser that the CSS files have changed (especially since some browsers are closed or offline) so it will simply lose the data: while the CSS file named style.css?0001 will remain in cache for up to a month, the pages on the site are now asking for style.css?0002.

The third, the cache refresh strategy, is a variant on the former: instead of merely invalidating the cached data, this strategy computes the new data and places it directly into the cache. This is necessary when the data is frequently accessed and the computation is long: if one hundred users come asking for the data while it’s invalidated, then all of them will compute the data as part of the “if missing, compute” step of the caching process, which will probably bring the server to its knees — what people call a stampede — so the only safe thing to do here is to keep data in the cache at all times and replace it with a more recent value whenever necessary.

Another nice design trick is to use a flexible expiration date to turn a cache invalidation strategy into a cache refresh strategy: instead of invalidating the item by removing it from the cache, merely set its expiration date somewhere in the past. Then, to avoid stampedes, whenever an user detects that the cache is expired, they set the expiration date to sometime in the future and start computing the data. So, the first reader will notice a delay (as his request will reconstruct the cache), subsequent readers will instantly read the old cached version without triggering a stampede, until the first request ends and the cache contains the new version of the item. To choose between the two: if the event that triggers the refresh provides you with data that improves the time required to update the cache, then update it, otherwise merely invalidate it and rely on your stampede protection to do the job.

There’s one last situation that makes caching complicated, which I’ve recently had to handle with the RunOrg application — indexing. Suppose you have a huge amount of data that you need to wade through, nicely split into separate objects that each have their own responsibility (my profile, my membership information, my participation to event X, my answers to poll Y…) but you sometimes need to virtually aggregate all that data and traverse it to retrieve only certain parts of it: give me the name, premium-or-not-premium membership status and answer to “T-Shirt Size” in poll Y, sorted by the registration date to event X. Yes, that’s one of the many queries that RunOrg lets you do (and you can even print out that data to serve as a list of participants). Now, trust me, there’s no sane way to dynamically run queries of the sort on a clean normalized database and still get reasonable performance. So, you need to create an alternate, denormalized representation of all that data and keep it in cache to avoid re-computing it for every request.

The problem with such a cache is that you cannot afford to re-construct a thousand lines of cached data because one user changed their T-Shirt Size answer, so there can be no high-level validity check. Basically, if you can’t trust the cache to be up to date when you run your read query, you lose. The traditional “try to read, if invalid update” approach to caching goes straight out the window. You need a solid cache refresh implementation that pushes the most recent data into the cache as soon as it becomes available.

RunOrg uses a variant — the cache pull strategy. This is merely a small semantic shift, but it’s quite helpful: in the original cache refresh situation, the data model needs to be aware of the cache, because it must actively send data to the cache whenever a change happens. With the RunOrg variant, the data model merely publishes a “data was updated” event that the cache module may listen to and react by refreshing its contents. So, the knowledge of how to extract data from the model and place it in the cache now belongs to the cache module instead of being spread over both the data model and the cache module. This not only makes the code cleaner — the data model becomes cache-independent and thus easier to read through, with cache modules being tacked onto it using the event system — but it also lets the cache react to events from different data models: an item might be updated when the profile changes, or the membership information changes, or the event X participation changes, or the answer to poll Y changes… and will have to read data from all four to compute the new value anyway.

Obviously, the cache pull strategy is a more complex architecture than the previous one:

  • You need an event system — the entire contraption hinges on the fact that a cache module can listen to the changes that happen in a data module.
  • Your cache module must track the dependencies of each item, in order to update that item when it receives a change event for one of its dependencies
  • You need asynchronous processing, as pulling values for dozens of items simply cannot be done as part of the standard HTTP response cycle
  • You need to follow clean multi-process patterns to handle simultaneous updates of some items

Still, given the performance we achieve with this approach, and the clean code that results from the underlying events-and-async structure, the results are certainly worth the efforts.

RunOrg is my Start-Up ; we provide an online tool that helps associations, unions, organizations and communities manage their members, contacts, activities, events, knowledge and online presence.

Work ≠ Progress

I did a lot of work today. Mostly, I tracked down and eliminated a nasty little problem related to our @runorg.com email addresses and our DNS records.

DNS is the directory system which determines which particular computer handles the requests to a given domain name. So, if you’re looking for holy-grail.runorg.com, a DNS entry mentions that it points to the machine known on the internet as 188.165.231.88, which happens to be our main production server.

The MX records are used when you’re looking for the mailboxes for that domain. This is because usually, you don’t want your web server to handle your e-mail: it’s handled elsewhere, such as another company server, or maybe gmail. So, you can specify a main DNS entry for your domain and then use the MX record to point to another server specifically for e-mail.

Finally, the CNAME records represent the canonical name. We don’t want our main web site to be available both on http://runorg.com and http://www.runorg.com, because it’s confusing and bad for the search engine ranking. So, I pointed a CNAME telling that runorg.com should point at www.runorg.com.

What I did not take into account (or even know) was that CNAME records are meant to be of a higher priority than MX records. So, when someone sent an e-mail to foobar@runorg.com, it would undergo canonicalization and point at foobar@www.runorg.com instead. Since there was no MX record for the latter, the e-mail would then disappear into the void. Our tools and newsletters apparently ignored the CNAME when sending e-mail, so we received those correctly.

So, my entire day was spent hunting down an obscure, unpredictable and not-quite-documented error in my DNS records. It was necessary work and it certainly kept me busy, but it wasn’t progress.

Our team has a looming deadline: the delivery of our first version of the software. It’s when we move from an “implement all the stuff we need before we can deliver” strategy to a “improve or add features to the existing product” strategy (which is an entirely different mechanism). Progress is what brings us closer to that transition — while dealing with the DNS issue was necessary, it did not move me an inch closer to delivering version 1.0.

What is the single largest difference between working as an employee for another firm and working on your own Start-Up? Before I started, I would have guessed it would be the work hours (I now work week-ends quite often), the commute (I work at home because we’re too small to need offices), the freedom (I’m literally by own boss) or the lack of money (no comment). Now it’s pretty obvious that the single greatest difference is that I now emphasize progress more than I emphasize work.

In my previous jobs, there was a fixed set of objectives which had to be accomplished, so I would just come to work every day and chip away at the monolith of work to be done, and since it all had to be done anyway, I could do it in any order I wished. Since I’ve started working on my Start-Up, I find myself increasingly questioning the very objectives I’m trying to accomplish — is this going to let me ship sooner, or not? The freedom of choosing (and discarding) my  objectives myself comes with the responsibility of making the right choices.

That’s a question I never asked myself before.

When you think about it, there are many things that are work but not progress. Some are done because it feels easier to do them sooner rather than later. Others are done because, let’s face it, sometimes you have low morale and a neat exciting feature comes up that you’d rather implement even though it’s purely gratuitous (I added a CSV export feature recently that is not necessary in any way, and I know my definition of exciting is weird but bear with me). Others stem from the necessary shame of delivering a half-baked product, but bear in mind that:

If you are not embarrassed by the first version of your product, you’ve launched too late.
- Reid Hoffman, LinkedIn founder

Delivering a huge product with a small under-funded team is ultimately a find-the-shortest-path endeavour. Choose your next objective based on that.

Google Chrome to end H.264 Support

This has been announced yesterday on the chromium blog:

[...] we are changing Chrome’s HTML5 <video> support to make it consistent with the codecs already supported by the open Chromium project. Specifically, we are supporting the WebM (VP8) and Theora video codecs, and will consider adding support for other high-quality open codecs in the future. Though H.264 plays an important role in video, as our goal is to enable open innovation, support for the codec will be removed and our resources directed towards completely open codec technologies.

And of course, it was met with a decent and quite enjoyable wad of trolls and flames in the comments section.

H.264 is a video compression method (the complete name being H.264/MPEG-4 AVC) which happens to be the de facto standard for the up-and-coming HTML5 revolution—the next step in web technology spearheaded by Chrome, Firefox 4, Safari and maybe Internet Explorer 9, which lets the user view videos without requiring a Flash Player. The H.264 standard is supported by most modern browsers, with the notable exception of Firefox, by Adobe Flash, as well as the hardware of several mobile devices including the iPhone and iPad. Yep: there are special chips available for decoding H.264 while using less battery power.

In short, if you encode video as H.264, chances are that anyone on the planet will be able to view it.

How widespread is it? If you’ve watched YouTube recently, you’ve seen a H.264 encoded video.

By contrast, the alternative proposed above (WebM) does not have the same range of support: Firefox and Chrome do support it, Adobe Flash support should happen “any day now”, and no significant hardware implementations exist yet. This means that a WebM video can be viewed in HTML5 on Firefox and Chrome, through a Flash-based player in the other browsers, through a battery-guzzling software codec on Android phones, and through the power of your imagination on iOS devices.

So, WebM is the codec with inferior support. Why is Chrome moving away from it?

This is actually a gambit to force content producers away from H.264, because Google is uncomfortable with the fact that H.264 is patented.

Being patented means that if you write software or hardware that encodes H.264 video (such as a camera), you need to pay royalties to an organization known as MPEG-LA. The same happens if you create software or hardware that decodes H.264 video (such as including a codec in the HTML5 implementation of your browser). And even if you only distribute content (using tools provided by others), you still need to pay royalties — the one exception here is that if you distribute content for free, you will never have to pay royalties.

The amount paid is not really an issue: it’s about 2% of the price of any content you distribute that’s over 12 minutes in length, $0.15 per subscriber if you have more than 100,000 … basically, by the time it starts to hurt, you’re already bringing in a lot of cash to cover your losses. So, while a free format like WebM would let you save those royalties, it’s probably not worth losing iOS customers.The only exception here is when Firefox needs to implement the H.264 codec for its own HTML5 support, which falls in the “decoding video” section above : this would end up costing the foundation a whopping $5 million for its 270+ million users worldwide. This explains why there is no such support.

Could the rates increase over time? Read for yourself:

Interestingly, MPEG LA calls out that fees cannot increase by more than 10% per year, but the bump from 2008-2009 to 2010 is almost a 20% annual increase.

Another very real issue is that to do anything with H.264, you need a license. If you have a single video on your small web site and it’s encoded as H.264, you need to contact the MPEG-LA and ask for a license. You will pay zero royalties, but you still need the license. If you’re a corporation, this means you need your lawyer team to study the topic to determine whether said license creates any liabilities for you, which isn’t free. That, or you accept the risk of doing things without a license. Your funeral.

By comparison, WebM is a free, open standard. No paperwork, no royalties, no licensing required.

The largest pain for Google remains Apple — a move away from H.264 is possible for anyone who does not need to support iOS. This will likely end as a battle of titans between the market share of Chrome and that of iOS devices, until one of them caves in and implements the other’s format.

As for HTML5, the standard is still being built, but there are three evils one can pick from and the lesser is not easily found:

  • Making H.264 support mandatory would be an honest acknowledgement of the format’s current omnipresence, but it would spell doom for any platform that cannot pay for the decoder license — open source browsers would only be able to display H.264 through HTML5 if someone decided to pay for the license (either in a fashion similar to how Adobe pays for including H.264 in Flash, or by offering a special plugin for a fee) and the hairy tangle of open source implementation of patented algorithms is sorted out.
  • Requiring support of either H.264 or WebM (or both every browser decides) seems like a good compromise, but the cost of hosting and serving both H.264 and WebM video is steep, so every content provider will probably end up providing only one of the two and rely on Flash to display the videos on non-supporting browsers. Seems like a standardization failure to me.
  • Making WebM support mandatory is an interesting solution, but it’s a waste of a perfectly good standard (H.264 is a good standard).

I wonder if the waste and confusion caused by the patents on H.264 are worth it, especially since for many of these patents the royalies are a reward for being the first to patent the ideas, and not for working hard to find them. In a world when every idea emerges from the brains of dozens of computer scientists and engineers, what is the rationale behind patents?

Diagon Alley – When to Jump on the Bandwagon?

In J. K. Rowling’s Harry Potter series, Diagon Alley is a fictional place in London filled to the brim with magic-related shops and institutions, hidden away from the eyes of non-magical humans. It makes sense, if you’re a wizard wishing to establish a new shop and seeking as large an audience as possible, to do so in Diagon Alley: not only would you benefit from the existing infrastructure that keeps Muggles away and allows easy access to wizards, but you would also have improved access to the customers of existing shops that happen to be in the area. More wizards walk through the alley in minutes than would walk through any other street in London over the course of an entire week.

In Paris, I enjoy the services of our very own Diagon Alley. It is actually called Rue Monsieur-Le-Prince, and it can be found stretching from the Luxembourg gardens near the Senate and north up to the Place de l’Odéon. Instead of magical shops, it is home to a massive number of Japanese sushi restaurants: Itadaki, Tokiotori, Kiotori, Yokorama, Top Sushi, Sushi Yaki and Sushi Royal among others. The number of such restaurants is surprising: this is a short, narrow one-way street, much smaller than the nearby Boulevard Saint-Michel and Boulevard Saint-Germain, which means that you literally cannot walk ten feet without seeing a sushi restaurant. The local market for sushi food is beyond saturated, and the competition between restaurants is fierce — a quick strategic analysis would determine that a new restaurant would enjoy far less competition if they were to establish themselves anywhere else.

This is not caused by a heavy immigrant population. In fact, the area is known for its high real estate rates and is out of reach of the average immigrant budget.

Nor is this a unique situation: Rue Saint-Anne is a sister street on the other side of the Seine where most of the Donburi and Okonomiyaki traditional Japanese restaurants can be found. All together, clustered on a single street.

This makes no sense. Why is this happening?

Back in 1991, Paul Krugman penned Increasing Returns and Economic Geography: as transportation costs decrease and economies of scale increase, Krugman argues, it becomes more efficient for a given industry to concentrate in a single location, because the transportation costs involved in having to move the products are paid for by the economies of scale involved. For instance, building cell phones in a single city and selling them in an entire country is cheaper than having a cell phone factory in every city.

Sushi restaurants have minor economies of scale as far as production goes: fresh fish is a bit cheaper when the delivery does not involve a one-hour detour (especially since small daily deliveries are involved to keep the fish fresh), and it’s easier to set up a new sushi restaurant in a building previously housed another that went bankrupt, because the fridges and tools and kitchen are already there. But does this explain everything?

Actually, there are economies of scale in marketing here: being an exotic food, a local market alone is not enough to support a sushi restaurant because most people don’t eat there daily. To survive, a restaurant must have a brand strong enough to cover a wider area. Few sushi restaurants have the power to create their own brand, but a dozen of them on a single short street is enough to make Rue Monsieur-Le-Prince the de facto location where sushi and yakitori can be eaten. This, in turn, creates an incentive for more sushi restaurants to appear there: customers are loyal to the street, not the individual restaurants, and a cleaner (by virtue of being newer) restaurant might attract more customers.

Of course, there’s still niche sub-markets within the street:

  • There’s at least one restaurant open for lunch or dinner, including holidays and week-ends.
  • There’s at least one restaurant open on afternoons.
  • The restaurant closest to the large Luxembourg underground station also happens to be the largest, in order to have as many customers as possible with average food quality.
  • There are two restaurants with quiet, clean private rooms for business customers.
  • There’s at least one restaurant with a nice second-floor view on the Boulevard Saint-Michel.
  • There’s at least one tightly packed chinese-friendly restaurant.

In conclusion, while there are advantages to finding your own market and having a monopoly there, there are also advantages to sticking close to existing competitors in the form of economies of scale in production and marketing: if you use mainstream tools, you will find more support for them ; if you provide a well-known kind of service, you will have an easier time convincing your customers that they need it.



1342 feed subscribers
(readers who polled a feed this week)