Monthly Archive for May, 2011

Dealing With Huge Projects

Right now, I’m the only developer working on RunOrg, which happens to be a 45k-line project written in OCaml. According to a common terseness observation, this is equivalent to managing a 135k-line project in Java. Alone.

OCaml shares a problem with many dynamic languages : it’s very expressive, but there is no general consensus on what architectural best practices should be, so there are literally dozens of different ways a given feature might be implemented that cannot be discriminated on anything but taste. This leads to a variety of unique design choices throughout the application which, despite working well with each other, cause programmers to «discover» new architectures every time.

In the end, I believe that the philosophy of using the best tool for every job can easily be taken to painful extremes if you are not careful. You encounter a new problem, pick an unusual but well-adapted solution, and it makes perfect sense to you, so you move on. Months later, you come back and the solution does not make sense anymore because you have forgotten a small detail about how it works or why it was done this way, and you have to hunt that small detail down by reading the code. I’ve pretty much solved this anti-pattern, so I’ll come back to it later.

The main point I’m making here is that for every project, there is an ideal mudball of code that happens to perfectly implement everything without bugs, all in a single gigantic file, and you cannot write this mudball. For a human, there’s no way to manage anything mudballish past a few hundred lines because you cannot wrap your mortal mind around the possibility that every line might interact with any other line in the project… so, as an architect, you slice up the mudball into more acceptable bits that you politely call «modules» in order to reduce the number of things any given line might interact with. We reduce the amount of data we need to cope by adding big «you don’t need to think about this» signs everywhere (and making sure the signs don’t lie, obviously).

And on a small scale, this works, because you only have a dozen modules and it’s enough to fit in your short-term memory. RunOrg currently has 260 high-level modules, and several times that amount in sub-modules. No UML design, no matter how comprehensive, can make all those modules fit in my mind at once. I must find some «you don’t need to think about this» signs before I can move on.

There are mostly two ways of slicing a given project into modules: horizontal and vertical.

Vertical slices happen when there are dependencies, and modules look like layers stacked on top of each other with each layer being allowed to access the layers below. The RunOrg project architecture actually starts with a clean set of vertical slices : the controller layer deals with HTTP actions by using the view layer and the model layer below it, but the view layer cannot access the controller layer, and the model layer cannot access either.

Horizontal slices happen when there are absolutely no dependencies, and modules look like books cleanly arranged next to each other on a shelf. This usually happens when those modules represent the same concept for different purposes. In the RunOrg project, the controller layer is divided into many action modules, with each of these modules handling the HTTP requests for a limited part of the application. For instance, there’s a Login module in charge of handling HTTP requests related to logging in, and a File module in charge of handling HTTP requests related to uploading files. The concept is the same (handle HTTP requests) but the purpose is different (logging in, uploading files). And there is no need for either module to know about the existence of the other.

Knowing whether slices are vertical or horizontal immediately tells the programmer about what dependencies should be considered for that slice. And it is all recursive : the Login module of the controller layer is further divided into a Login_common bottom layer for common definitions, the root Login top layer for binding everything together, and an intermediary layer of horizontal Login_form, Login_signup, Login_lost slices dedicated to the various independent aspects of logging in. The naming convention helps identify the pattern used.

In practice, the slices do not necessarily map to actual namespaces or modules because, especially at very low levels, the granularity involved to segregate the two would be too verbose. For instance, while it may appear that the controller layer is made up of modules that are all horizontal slices, this is not the case : while the actions (functions that respond to HTTP requests) are indeed independent horizontal slices, the layer also contains helpers (functions that provide common functionality to actions) that follow a vertical layering, and a given module will usually contain both actions and helpers indiscriminately.

What is relevant here is that the patterns used will let you determine easily what kind of slice you are dealing with. And a pattern is a named convention (action, action helper, view template, table) that is respected by relevant pieces of code, in terms of :

  • Location : where is it within the module and file hierarchy, and in relation to other constructs within the same module ?
  • Structure : how does the code look like ? What parts of the pattern are expected to be changed and what parts should always be the same ?
  • Type : what is the signature of the module, class or function defined by the code ?
  • Name : is there a common suffix or a way to give a name to entities following the pattern ?

These are guidelines, a pattern should usually have at least one of these, and the more the better, but you don’t have to implement all four if it is counter-productive to do so. Also, a pattern should define a dependency rule : it is generally understood that two pieces of code that follow the same pattern have a dependency dictated by that pattern, and that dependency is usually a horizontal slice.

The important thing about patterns is that they are not an external influence on your project. If you limit yourself only to those patterns that are dictated by the Gang of Four book, or by the framework you are using, then you will miss out on the many patterns that will emerge naturally within your application. Quite to the contrary, it is essential to identify as often as possible the patterns that appear in your code, clean them up by providing both a name and conventions of location/structure/type/name, and apply them wherever necessary. This will make your code more easily recognized by the programmer, because there are only a handful of fairly generic concepts to learn (the patterns) and everything else can be understood by finding out what patterns are used. Even better, familiarity with patterns places a  «you don’t need to think about this» sign on the parts of the pattern structure that stay the same, because they never change.

And now, I have cleverly returned to my previous point : the inevitable conflict between the use the best tool rule and the use the same pattern rule.

Using the best tool creates the risk of writing code that is very difficult to understand later on, because there are too many special cases. Using the same pattern everywhere causes problems when the pattern is ill suited to the problem being solved, such that it creates code that is too long, too repetitive, or too unsafe.

In my day-to-day routine, I follow the use the same pattern rule until it becomes too painful. Then, I just change the pattern to make it less painful to use, and propagate the changes to all the places where it is used, which in turn was made possible by the fact that I did use the same pattern everywhere.

Obviously, I don’t have a pattern for everything. So, whenever I encounter a problem for the first time, I go with the best tool instead. Once that kind of problem is solved several times, a pattern will emerge and some refactoring will happen.

Article image © holycalamity — Flickr

Rewrite Your Code

Writing code relies on four kinds of decisions:

  • What algorithm can implement this feature?
  • How is that algorithm best written in that specific language?
  • What platform quirks and subtle edge cases must be accounted for?
  • How does this code fit in with the rest of the application?

Regardless of team experience or preliminary analysis, some of these decisions will be incorrect. Maybe the algorithm failed to take into account the unusual distribution of real-world data ; maybe there was a better way to write it ; maybe there’s a subtle bug that will not be discovered for weeks ; maybe a possible code reuse has not been identified during the design phase… or maybe the customer requirements that the feature was based on were not actually adapted to the customer needs.

Such bad decisions get in the way of users, but they also hinder developers, who have to regularly work around existing bad decisions, which in turn causes more bad decisions to be made in recurrent “lesser of two evils” situations.

It is a good idea to go back on your bad decisions and make new ones instead. They will not necessarily be good, but at least they will address some of the problems with the old ones.

Don’t try to go back on everything at once. Most of the time, the shortcomings of a decision can be identified in hindsight, change too many things at once and hindsight will be lost. In particular, throwing away non-trivial portions of code (anything beyond a single function) in order to rewrite it from scratch is quite risky, especially since it might also discard good decisions that would be hard to retrieve.

Don’t make your code difficult to change. Going back on your decisions will involve rewriting code. Lots of it. So far, most of the code in the RunOrg project has been rewritten at least three times. Make sure your language, frameworks, libraries and unit tests all work together to make it easy to evolve specific parts of your code to change decisions. The worst situation for a project to be in is code freeze — changing code is forbidden because it’s too risky and it might break something. If you suspect that your project might be heading that way, immediately drop everything you are doing and bring your project back to an acceptable state ; if you are not allowed to do so, make sure you send out a warning to anyone who might need to know.

Don’t make too many decisions. This is usually spelled out as YAGNI : You Ain’t Gonna Need It. If there is currently no need for a given feature, other than the fact that it should remain possible in the future, then don’t implement it. Implementing it will involve making many decisions about how it should happen, and lack of practical application will increase the odds that those decisions are wrong.

Don’t be afraid to go back on huge decisions. Weeks ago, an initial decision we made on the RunOrg project turned out to have huge performance implications. I was faced with two choices : keep that decision, and manually optimize the locations where the performance suffered the most (this involved manually handling caching and batches) ; or go back on that decision, re-architecture the entire database access system and propagate those changes throughout literally half the project, in order to allow automatic caching and batch construction in ways that manual optimization could never allow. The rewrite took me four days, with some aftershocks being felt several days afterwards (strangely enough, changing 20k lines of code resulted in only four fairly obvious bugs).

What does your decision-making process or pipeline look like? What does your decision postmortem and reversal process look like? How often do you go back on your decisions?

Article image © Barb Crawford – Flickr

I’m Going to Miss the Internet

My first dealings with the internet went through a 56k modem. I had to find and save pages to the computer to browse them offline in order to avoid the large phone bills that came after you stayed online for too long. These days, I have five computers plugged into a single fat pipe at all times, with more bandwidth than I could ever use, at one hundredth of the former cost. But still, as the internet and the computing world improved and matured, some key aspects were lost.

Browsing the internet used to be an anonymous activity. As you came online, you were awarded an IP address, which acted as your avatar in your dealings with other computers on the network. There was no way for anyone on the internet to reliably trace any kind of online activity back to your real-life existence, because there was no link between IP addresses and human beings. Even if someone did find out that you owned a given IP address, you could still argue that it had belonged to someone else when the activity took place. Sure, a handful of countries that were known for their human rights track record could play Big Brother with their citizens, but I lived in a first world country that would certainly respect my right to privacy. I was wrong. Browsing the internet in France is no longer anonymous, as internet service providers are required by law to log the owners of every single IP address they allocated. There is now a link between your IP address and your name and home address, and government agencies may follow that link to hunt you down.

I used to believe that the Internet was immune to such tampering because it was decentralized, that the RIAA and MPAA were fighting a losing uphill battle, that any attempt to restrict online freedom would be voided by technical counter-measures and workarounds. This belief was epitomized by John Gilmore in his 1993 quote:

The Net interprets censorship as damage and routes around it

This warm feeling of eternal resilience relied on a single assumption : almost every single data transfer technology can be abused to transfer illegal data (the latest Lady Gaga single, child pornography, mentions of Tian’anmen Square), and the government cannot afford to outlaw all data transfer technologies. I call this the Collateral Damage Assumption — any effective solution would involve too much collateral damage to be implemented by lawmakers. But this assumption, as self-evident as it may seem in a first world country, is incorrect.

Subtle side-effects

One reason why this assumption breaks down is that lawmakers only care about flashy, obvious side-effects. They honestly believe they can get away with subtle side-effects, so they will settle on solutions that hide away the collateral damage so that taxpayers will not notice it until it is too late. I have an actual example here, so bear with me.

A few years back, copyright owners spied on peer-to-peer networks to identify the IP addresses of illegal downloaders, traced those back to the actual names and home addresses of real-life people, sued them for infringement, and failed because there was no proof that those people were actually guilty of downloading copyrighted works, as opposed to merely being the unlucky owners of a hijacked WiFi network — it takes a few minutes and a few dollars to hack into a secured WiFi network, not to mention all those open WiFi hotspots in various restaurants and institutions.

Then, the law that became known as HADOPI was introduced. Among other things, the bill made it a misdemeanor to connect to the Internet a device that is insufficiently protected against malicious users. If a copyrighted work was downloaded from your IP address without your consent, then you failed to protect your internet connection against that malicious user and you would be sentenced for the misdemeanor. Can you swear that your home network is secure? Do you regularly change the WiFi key, keep your router firmware and operating systems up to date, and monitor your traffic for any suspicious activity? Me neither, and I suspect the average Internet connection owner does not even understand what changing a WiFi key involves.

The media and several activist groups made a fuss about the fact that the sentence carries the possibility of being barred from owning an internet connection for an entire year. That’s annoying and extreme, but certainly not the main issue.

Few recognized this law for what it was: reducing the number of false negatives (letting pirates off the hook) at the cost of having more false positives (punishing helpless, innocent people). But those false positives are a subtle side-effect: the only people who notice are those directly affected by it, and those with the technical skills to understand that securing an internet connection is hard. Outside of well-informed technical circles, the general opinion on the HADOPI remains that you will only be punished if you download copyrighted works.

And there were even subtler effects. One of them was that many pirates, aware that they were at risk of being discovered, started using encrypted file sharing protocols in order to evade detection. This significantly increased the amount of encrypted data over the network, because downloading the latest episode of The Big Bang Theory uses more bandwidth than all your HTTPS browsing and SSH terminals combined. Needless to say, the NSA was less than happy about having a lot more data to sift through to when looking for terrorist threats.

While on the topic of subtle collateral damage, there is yet another example, this time in an otherwise fairly decree by our government. Around these parts, laws provide a general framework, and decrees are then used to fill in the details such as what forms should be filled, how much money must be paid, or what data is covered by “should keep the relevant information for at least one year”. In this case, the decree asked for user passwords to be kept around for at least one year, going against the fundamental principle of password security which is to never store user passwords, ever. I’m fairly certain that the people who added “and passwords” to that decree had absolutely no idea that this was an insanely bad idea, and I suspect that it would take quite some time to explain exactly why it’s such a bad idea.

General Misunderstanding

In the end, we live in a world where only a small technical elite can hope to understand the consequences of such decisions — and that is when we do agree with each other. Decisions by the unsuspecting lawmakers, unopposed by the uninformed general population, can ultimately hurt the Internet in subtle but permanent ways.

This week, the Queensland police likened receiving photos to taking stolen television sets. This is a pretty good analogy, except for the fact that 1° you cannot make a copy of a stolen television by clicking a button and 2° you do not receive thousands of television sets (stolen or otherwise) on a daily basis while browsing the web.

The easiest way to explain computing concepts to normal people is to use analogies, and all analogies are inherently flawed. Hilarity ensues when the analogy is taken to its logical but incorrect conclusion.

To make sane decisions, instate sane laws and pass sane judgements on the computing world, working by analogy is the last thing you want to do. Copyright infringement is not theft. Privacy invasion is not theft. The only acceptable way of dealing with the complex technical concepts around us is to determine their consequences in the real world, and decide based on those consequences.

What are the real-world consequences of a journalist receiving unauthorized Facebook pictures when writing an article about the security issues that allowed the pictures to be obtained in the first place? Are any of these consequences worth arresting the journalist and confiscating his property?

Being Left Behind

There’s another reason why the Collateral Damage Assumption is incorrect. We say to the computer manufacturers “let us install any software we want on our computers, or you will kill the economy” and thus we retain the right to install any software. Can you imagine the next version of Windows refusing to install any kind of peer-to-peer software? That would require some heavy restrictions on installing new software, so no one would buy it.

There was no collateral damage to Apple deciding that all applications on the iPhone must be accepted by the App Store first. They defined a new market and set their own rules, and most people accepted this situation without flinching.

We praised the Internet, and the computing world, for their versatility, for their ability to evolve around any obstacles in their path. But we assumed that this meant those features we held so dear would remain forever. This is completely wrong : the world will move away from any features that do not fit in anymore. I assumed that I would forever be able to participate anonymously on various online communities, but they are starting to use Facebook Comments because there is now a critical mass of people who 1° use Facebook and 2° don’t care about writing things in their own name on the Internet. The “mainstream Internet” has already given up on many earlier features I took for granted :

  • Browsing without cookies or javascript. Now, sites require these even if you do not have an account.
  • Interacting anonymously or with pseudonyms. Now, you need to use facebook.
  • Dealing with many small tools and communities. Now, there are a handful of huge “cloud” conglomerates and communities.
  • Content placed online by competent experts. These days, anyone can create a blog to share they’re [sic] mistakes with everyone else.

As with anything that evolves, nothing is forever, not even those things that we though the Internet could never exist without.

The Internet isn’t dying. It’s becoming something else that I’m not entirely happy with.

Monads and Asynchronous Javascript

A minor yet interesting syntax extension for Javascript would help solve the eternal problem of too many nested anonymous functions when writing asynchronous code. For instance, reacting to an AJAX request in standard jQuery looks like this:

$.getJSON('/status.php', {id : the_id}, function(data) {
  if (data.deleted) {
    $('#' + the_id).hide(500, function(){
      $(this).remove();
    });
  }
}

The function nesting is ugly, but necessary. In a monadic writing style, it would look like this instead :

var! data = $.getJSON('/status.php', {id : the_id});
if (data.deleted) {
  do! $('#' + the_id).hide(500);
  $(this).remove();
}

This style restores the imperative look of the function, hiding away the asynchronous nature of the code in the special keywords.

In itself, the rewriting is pretty simple to perform based on two unambiguous rules :

var! a, b, c, d = expr(x,y,z);
more code

Becomes :

expr(x,y,z,function(a,b,c,d) {
  more code
});

And there is a shorthand notation:

do! expr(x,y,z);
more code

That becomes :

expr(x,y,z,function(){
  more code
});

As a final example, here is how the “Rate Me” jQuery example would be written :

do! $(document).ready();

// generate markup
$("#rating").append("Please rate: ");

for ( var i = 1; i <= 5; i++ )
  $("#rating").append("<a href='#'>" + i + "</a> ");

{
  // add markup to container and apply click handlers to anchors
  var! e = $("#rating a").click();

  // stop normal link click
  e.preventDefault();

  // send request
  var! xml = $.post("rate.php", {rating: $(this).html()});

  // format and output result
  $("#rating").html(
    "Thanks for rating, current average: " +
    $("average", xml).text() +
    ", number of votes: " +
    $("count", xml).text()
  );
}

Doesn’t that look nicer ? I certainly look forward to something similar being included in ECMAScript.



1342 feed subscribers
(readers who polled a feed this week)