Tag Archive for 'Language Design'

Let’s Whine About Google Dart

What’s wrong about 17259 lines of code ? asks Andrea Giammarchi, following the publication of a gist that shows an 18-line Hello-World application written in Dart, the new Google-sponsored browser-based language, and the corresponding 17259 lines of JavaScript that the Dart source code would be compiled to.

Yes, 17259 lines of code is quite terrifying, and reading through the code does make one ponder what those engineers at Google were smoking and how they can ever expect such a language to overtake JavaScript — or even manage to be used in the first place.

And then, there’s a loud WHOOSH: the sound of a point flying so high over the collective heads of Dart critiques that it will probably collide with some garbage in low earth orbit.

The point is that this is version 0.01 of a language designed for optimization through static analysis.

I’ll be the first to admit that there is no way for a Hello-World application to execute all those 17259 lines. This is known as dead code: pieces of functionality which might be useful to some applications but not to others, which is included because it’s easier for the developer, but which will never run.

I can understand why Andrea is upset. Dead code is a big no-no in JavaScript, because it has to be downloaded and parsed by every user, which costs bandwidth and slows down everything. No, wait. Let me correct that last statement. Dead code is a big no-no in JavaScript because the JavaScript language makes it impossible to have automatic dead code elimination.

In JavaScript, you can have a global function named foobar which is not even once referenced in the entire code base, and it would still be unsafe to remove it because at some point the code fetches a funcname variable through AJAX and executes window[funcname]() — and it just so happens that the server returns "foobar" on the night of the full moon. In short, there’s no way in hell an automated tool can remove a single line of code from JavaScript without breaking at least one possible program. That’s just the way JavaScript is — objects are key-value containers with string keys.

Of course, you could decide to follow rules that prevent such things from happening. There’s at least one convention, for instance, which is used by minifiers to try and be a little more than glorified whitespace removers: variable renaming. If you have a local variable in JavaScript, you can rename it and all its occurences and the change will never be visible outside the scope it was defined in because the variable itself is not visible outside. This lets you rename aQuiteLongVariableName to g, which makes your code lighter. And it breaks because one of your function uses eval on an arbitrary string provided by the server and which references aQuiteLongVariableName. So, the convention is « do not use eval » but it’s still a convention, not a language specification, so it has to be opt-in and clearly document its requirements.

Dart was designed not to have these issues, so it does not matter if version 0.01 of the Dart-to-JavaScript compiler fails to perform dead code removal: it can be implemented, and Google engineers would indeed be fools not to implement it before the 1.0 release.

There are also questions to be asked about the actual contents of those 17259 lines — when they will be used, will it make sense?

One of the first things that I notice is, as Andrea mentioned, the various global binding functions such as:

function $bind1_0(fn, thisObj, scope) {
  return function() {
    return fn.call(thisObj, scope);
  }
}

This is actually an extremely important optimization, but not related to speed — it’s about memory. To understand why, look at this code:

var div = document.createElement("div");
var message = "Hello";
div.onclick = function(){
  alert (message)
};

The anonymous function here is a closure: it must carry the variable message with it in order to access it when it is called. In any sane language, the interpreter would store a reference to that variable in the closure, and be done with it. JavaScript cannot (again, because of eval) so it has to store the entire scope that the function was defined in, which includes both message and div. In some versions of IE, having such a circular reference between a DOM element and a JavaScript value creates a memory leak which leads to massive performance degradation, but even in better browsers than IE, the function will keep this div alive for longer than is really necessary, thus consuming memory for nothing.

Dart has the ability to perform this scope analysis and determine that only the message needs to be included. The truckload of bind functions help implement this by utterly eliminating the JavaScript closure:

function onclick(message) {
  alert(message);
}

var div = document.createElement("div");
var message = "Hello";
div.onclick = $bind1_0(onclick,div,message);

Another critic from Andrea was the massive amount of wrappers around elementary functions, such as setting an array element to a value or adding two values together, which should bring JavaScript-from-Dart programs to a screeching halt becuse of all the involved overhead.

They won’t. Trust me. This isn’t the first time I have seen such wrappers, and I know what they are and why they exist. They are generic adapters — fallback operations to be used when the static analysis step cannot determine precisely the types of the objects involved.

Quick. Translate the following code from Dart to JavaScript by hand in an optimal manner:

swap(a,i,j) {
  var t = a[i];
  a[i] = a[j];
  a[j] = t;
}

Chances are, you answered something like this:

function swap(a,i,j) {
  var t = a[i];
  a[i] = a[j];
  a[j] = t;
}

You would be wrong. You are under the mistaken impression that, because it uses the []-syntax, the Dart function works with arrays when in fact it works with any object that defines that access operator.

So, on your second attempt, you would answer something like this:

function swap(a,i,j) {
  var t = ugly_getter.apply(a,i);
  ugly_setter.apply(a,i,ugly_getter.apply(a,j));
  ugly_setter.apply(a,j,t);
}

And you would be wrong again. The key property of Dart (as well as other languages which encounter the same problem — OCaml, C#, C++, Java… anything with generic polymorphism actually) is that it allows static analysis. Static analysis lets you determine what the types of those variables might be. Is the swap function ever called on something other than an array? Then only define version one. Is it called on many things, but most often arrays? Define both functions, and use the specific array-compatible one whenever you are certain an array will be passed. Is it never called? Remove it…

This kind of analysis is reasonably simple to perform, and has the benefit of being able to create both highly-optimized code for primitive types and short generic code for abstract types. And you can even help the compiler by annotating types.

Again, it does not matter whether Dart 0.0.1 actually performs these optimizations. What matters is that they are both possible and easy given the language design, and I expect them to be present in release 1.0

I am by no means a Google zealot or secret Dart admirer. In fact, from what I have seen so far, I would sooner loathe Dart than admire it — a bastard type-checking that relies on warnings? A bland java-like syntax for brace weenies? No actual support for immutable data structures? Blech. Give me JavaScript CoffeeScript anyday. I’ll switch to Dart when it takes over the world.

But the current crapstorm is just ridiculous. Come on, the guys at Google are pushing out a 0.0.1 proof-of-concept and you’re complaining that their optimizer isn’t acceptable? Please. I’m not even sure there’s an optimizer yet. From the language specs, it’s fairly obvious that many optimization avenues unavailable to JavaScript will be available to Dart, the only question is whether release 1.0 will have them or not. Wait and see — criticizing the optimization of a 0.0.1 compiler is just silly.

Article Image © Erich Ferdinantd — Flickr

Functional Programming

Yesterday evening, I was present at Paris Hackers, a meetup for Hacker News enthusiasts — and a «hacker» in this context is not someone who breaks into software systems, but someone who is passionate beyond words about how software works. But the definition is vague, and that vagueness is part of my surprise.

We had a little chat on the topic of recent buzzwords — concepts that people were interested in, but not everyone understood. One of them was Functional Programming. The speaker asked who knew enough about it to explain to the others, and I naturally raised my hand: I was taught functional programming formally, I taught it myself, I implemented a functional programming language, and I am using functional programming professionally on a daily basis.

My surprise was that I was the only one. I had sort of assumed that a majority of «hackers» would know what functional programming is. I will have to revise that definition.

« Having Functions »

Functional programming itself has a definition problem: depending on whom you ask, you can get two different definitions on functional programming. One definition is what I would call pure functional programming, and the other is better summarized as « having functions »

Of course all languages have something they call «functions» but I strongly believe that there is a list of minimum requirements before you are allowed to say that.

  1. You should be able to define a FOOBAR anywhere. At global scope. Inside a class. Inside another function. Inside an expression. It can be an anonymous FOOBAR, or it can be given a name, depending on the situation.
  2. You should be able to store a FOOBAR in a variable, pass it as an argument to a function, and return it from a function. If your language is strongly typed, it should have an easily described type as well.

Replace FOOBAR with «integer» or «string» and you will notice that without these requirements one can hardly say a language «supports integers» or «supports strings» — what if you could not store integers in variables, or write string literals as part of expressions?

These requirements are called «having FOOBARs as first-class citizens» by the way. In C for instance, functions and arrays are second-class citizens, and integers are first-class citizens. In JavaScript, functions are a first-class citizen.

Allowing functions helps make a language a lot more expressive. Let’s think of a quick example: performing an HTTP request and displaying the result in a window. In pseudocode that looks shamefully like JavaScript code, it would look like this:

function display(url) {
  var page = http(url);
  return new Window(page);
}

This function has an obvious limitation: if it takes five seconds to load the data through HTTP, then the program will freeze for five seconds and only then display the window. A better solution would be to make the request asynchronous: a window in a «loading» state is returned immediately, and a background process takes care of the HTTP requests and fills the window with the contents once they have been received.

A clean way of encapsulating this would be to let the http function handle the asynchronous behavior. The calling code would only need to provide a description of what needs to be done once the contents have been received. There are two ways of doing this in an object-oriented world.

Object-oriented solution #1: inherit from an http request class.

class GetPageRequest extends HttpRequest {

  // We need to know what window to put the contents in
  function GetPage(window,url) {
    super(url);
    this.window = window;
  } 

  // Override this function to react to the data being received
  function onSuccess(page) {
    this.window.setContents(page);
  }
}

function display(url) {
  var window = new Window(loading);
  var request = new GetPageRequest(window,url);
  request.runAsynchronous();
  return window;
}

Object-oriented solution #2: implement an «HTTP success event handler» interface.

class OnReceivePage implements IHttpRequestSuccessHandler {

  // What window do we need to fill ?
  function OnReceivePage(window) {
    this.window = window;
  }

  // Handle success (as previous example)
  function onSuccess(page) {
    this.window.setContents(page);
  }
}

function display(url) {
  var window = new Window(loading);
  http(url, new OnReceivePage(window));
  return window;
}

Both solutions are quite long, because object-oriented solutions involve a certain amount of syntactic overhead. In both cases, the only code that is actually relevant to our work is the contents of the onSuccess function. What if the language allowed us to define that function directly, and passing it to the HTTP request without any syntactic fuss?

Functional solution #1: local function definition.

function display(url) {
  var window = new Window(loading);
  function onSuccess(page) { window.setContents(page); }
  http(url, onSuccess);
  return window;
}

In fact, why bother with giving the function a name? Its contents are obvious enough, so it would be shorter and cleaner to just passing it to the HTTP request without any name.

Functional solution #2: anonymous function definition, or «lambda»

function display(url) {
  var window = new Window(loading);
  http(url,function(page) { window.setContents(page); });
  return window;
}

The latter example is syntactically correct JavaScript code, by the way. As you can see, the code is both shorter and more readable than the class-based version. In general, using inheritance or interface implementation to define a single method is always significantly more wasteful that using an anonymous function.

There’s a third requiremend that I left unvoiced. Look back at the previous examples again. Notice how the class-based examples have to store the window object explicitly, but the functional versions use the variable directly? This is known as «closures» : when defining a function inside a function, the defined function actually carries with it all the variables that were present when it was defined. Functional languages always have closures, because any non-trivial operation you might wish to place in an anonymous function will involve accessing data that was present near the definition.

In summary: functional programming involves the ability to manipulate functions as you would manipulate any other data type (define anywhere, store, pass, return…) and allows writing more concise programs whenever you need to pass, store or return some behavior to be executed later.

Examples of functional languages:

  • JavaScript
  • Lisp (including Common Lisp, Scheme …)
  • ML (including SML, OCaml, F# …)
  • Haskell
  • Ruby
  • C#

Examples of near-functional languages:

  • PHP only supports lambdas and closures since 5.3, and closure syntax is suboptimal.
  • Python has 99% support, the only issue is that only one-line lambdas are allowed.
  • XSLT 2.0 has no anonymous functions, but is otherwise functional. I kid you not.

Examples of non-functional languages:

  • Java
  • C
  • C++

Pure Functional Programming

This is the other definition of «Functional Programming» that you can hear, and the confusion is understandable: languages that match definition #2 have long been the only examples of definition #1 as well.

The relevant concept is purity: can the impact of a function on the rest of the program be described only in terms of its return type, and can the impact of the rest of the program on the function be described only in terms of its arguments?

Consider the signature of an interface method:

public int frobnicate(Foo f);

There are many ways in which that function can affect the rest of the program:

  • It could call a global function to perform changes in the global state, such as altering a singleton, setting a global variable, printing data to the screen, sending data over a socket…
  • It could change member variables of the object it is called on.
  • It could call methods on its argument.
  • It could return a new value that is then used by the calling code.

There are also many ways in which the rest of the program could affect the behavior of the function:

  • It could depend on global variables to decide what it should do.
  • It could read global state (user input, files, sockets…)
  • It could access the data of the object it is called on.
  • It could access the data of its argument.

The entries marked in bold can be deduced from the method signature: having a return type means you want to return something, being non-static means you wish to read the state of the object you are called on, having an argument means you want to read data from that argument. The other entries are possible, but not readily apparent in the signature.

Pure functional programming is the principle of least surprise: sure, those entries that are not in bold are possible, but it would be surprising if they happened, so they are in fact not allowed. A pure function can only read data from its arguments and can only write data by returning it.

Also, pure functions imply that data is immutable — if functions are not allowed to change anything, then they are not allowed to change data structures, so the data structures are by definition immutable.

By the way, pure functional programming is all about what you cannot do — so you could in theory write functional programming in any language just by following coding conventions to that end. In practice, immutability is quite hard to respect if there is no syntax for making it easier — you spend a lot of time creating manual copies of objects in order to change a single field.

Working with pure functions and immutable data structures involves both benefits and trade-offs when compared to non-pure programing. Think of it as a slider which goes between «0% Pure» and «100% Pure» and you pick one value for your entire program. Many algorithms and concepts get easier to understand, change and manipulate as the slider moves towards the 100% end, but there are some concepts that are intrinsically non-pure and actually get harder and harder to think about as your slider moves away from the 0% end. For instance, working with a persistent data store is an intrinsically mutable endeavor, and it’s hard (but possible) to think about it in pure terms.

I find that a slider value of 95% in a language that makes such a value possible is the optimal place to be — the 5% includes writing code that is truly non-pure (such as database manipulation, responding to HTTP requests…) as well as code that is non-pure on the inside but is actually pure from the outside (such as memoization and caching).

Article Image © Tim Olson — Flickr

Monads and Asynchronous Javascript

A minor yet interesting syntax extension for Javascript would help solve the eternal problem of too many nested anonymous functions when writing asynchronous code. For instance, reacting to an AJAX request in standard jQuery looks like this:

$.getJSON('/status.php', {id : the_id}, function(data) {
  if (data.deleted) {
    $('#' + the_id).hide(500, function(){
      $(this).remove();
    });
  }
}

The function nesting is ugly, but necessary. In a monadic writing style, it would look like this instead :

var! data = $.getJSON('/status.php', {id : the_id});
if (data.deleted) {
  do! $('#' + the_id).hide(500);
  $(this).remove();
}

This style restores the imperative look of the function, hiding away the asynchronous nature of the code in the special keywords.

In itself, the rewriting is pretty simple to perform based on two unambiguous rules :

var! a, b, c, d = expr(x,y,z);
more code

Becomes :

expr(x,y,z,function(a,b,c,d) {
  more code
});

And there is a shorthand notation:

do! expr(x,y,z);
more code

That becomes :

expr(x,y,z,function(){
  more code
});

As a final example, here is how the “Rate Me” jQuery example would be written :

do! $(document).ready();

// generate markup
$("#rating").append("Please rate: ");

for ( var i = 1; i <= 5; i++ )
  $("#rating").append("<a href='#'>" + i + "</a> ");

{
  // add markup to container and apply click handlers to anchors
  var! e = $("#rating a").click();

  // stop normal link click
  e.preventDefault();

  // send request
  var! xml = $.post("rate.php", {rating: $(this).html()});

  // format and output result
  $("#rating").html(
    "Thanks for rating, current average: " +
    $("average", xml).text() +
    ", number of votes: " +
    $("count", xml).text()
  );
}

Doesn’t that look nicer ? I certainly look forward to something similar being included in ECMAScript.

Do Variant Types in OCaml Suck?

I recently stumbled once more on Tomasz Wegrzanowski’s blog, on one of the early articles he wrote in 2006 about Objective Caml — unequivocally titled Variant Types in OCaml Suck. Hence the title of this article.

Objective Caml polymorphic variants have the benefit of being created on the fly: you just write ` kitten and you have successfully created a value of type [> `kitten] (do note that you have also created a singleton type out of nothing — this can be quite useful in some cases). Of course, the ability to define types on the fly everywhere puts a strain on the type inference algorithm, who must then determine whether you are mistaken or not.

Type Inference : Damas-Milner

Type inference in ML language strives to determine the type of every value and variable in the code. It does so by giving every variable a completely general type, and then examining any constraints that apply to progressively turn it into the most general type that satisfies all constraints. Let’s try a quick example:

  • When you write fun x -> fst x + snd x, the type of x is initially 'x.
  • The type of fst is 'a * 'b -> 'a, which means that the type of x must have the shape 'a * 'b (because it’s provided as an argument to fst). So, the constraint implies that 'x = 'xa * 'xb (and that fst x is of type 'xa).
  • The type of operator + is int -> int -> int, from which the type of fst x is constrained to be 'xa = int and thus 'x = int * 'xb.
  • The type of snd is 'a * 'b -> 'b, but since 'b = int in the present context, this means that its actual type is 'a * int -> int and thus 'x = int * int

So, by applying constraints deduced from the code, the type inference algorithm computes the type of variable x. Applying a type constraint is known as unification: two types are determined to be equal, so they are superimposed on top of one another in order to construct a new unified type. This happened in the last step above: x had been determined to be of type int * 'b and this type was unified with type 'a * int. Unification is a simple pattern-matching exercise: one explores the two types as if they were trees, and…

  • If identical nodes are encountered (such as a _ * _ 2-tuple node or a leaf-less int basic type node), the unification proceeds recursively.
  • If a type variable (such as 'a or 'b) is encountered, it is declared equal to the corresponding element in the other type.
  • If two different nodes are encountered (such as a _ * _ constraint being unified with an int variable), an error happens.

Polymorphic Variant Types

Polymorphic variant types are lists of possible constructors. For instance, [`kitten | `walrus] represents a type that contains the constructors `kitten and `walrus.

When you write `kitten, the resulting type is [> `kitten]: an open variant type that can be unified with other variant types. Consider, for instance, the array [| `kitten ; `walrus |]: it is of type 'a array where 'a can be unified with both [> `kitten] and [> `walrus], which results in [> `kitten | `walrus].

Now, let’s consider the reverse operation: pattern matching. You write a simple match animal with `kitten -> "meow" | `walrus -> "bukkit": the type of `animal will be computed as [< `kitten | `walrus].

The difference between [< ..], [..] and [> ..] is obvious in terms of unification:

  • [< `a | `b | `c] can only be unified with variant types that contain less constructors than itself. So, it can be unified with [ `a | `b ] (which yields [ `a | `b ]) but not [ `a | `d ]. The reason is quite plain: if you were allowed to introduce `d into the variant type, then you could run the above match with animal = `owl and break the code. In short, [< ..] means “any others are forbidden”.
  • [> `a | `b | `c ] can only be unified with variant types that contain more constructors than itself (including other open types). So, it could be unified with [> `a | `d] (which yields [> `a | `b | `c | `d]) but not with [ `a | `b ] (which does not allow `c). In short, [> ..] means “these are allowed”.
  • [`a | `b] (a closed type) is both [< `a | `b] and [> `a | `b] at the same time. It can only be unified with variant types that allow those two but no others, and cannot be unified with any closed types other than itself.

This works, but it’s not perfect.

One major issue is the fact that every variable has a single type. So, suppose you want to transform `b into `c but leave any other constructors untouched. You would then write function `b -> `c | x -> x, but that would be a mistake: being present in the pattern matching means that x is of type [> `b] (it can accept any value, including `b) and being returned means it is unified with the type [> `c] of the other possible return value. So, the type of this function is actually ([> `b | `c] as 'a) -> 'a: even though you replaced every `b with a `c, the type inference insists that the result type can still contain a `b… In fact, there is no way to write such a function because there is no way to express the type of that function in Objective Caml.

Type Coercion

Another issue is that sometimes, you need unification between two different closed types. Consider this short example:

let string_of_vehicle : [`car | `train | `bike]       -> string = (* ... *)
let is_heavy       : [`car | `bike | `cake | `anvil ] -> bool   = (* ... *)

List.map (fun x -> if is_heavy x then string_of_vehicle x else "") [ `car ; `bike ]

This code should work, because I’m working on a car and a bike, both of which can be passed safely to both string_of_vehicle and is_heavy. However, due to external factors, the types of string_of_vehicle and is_heavy are closed and x has to unify with both. In such a situation, the solution in Objective Caml is to allow the user to make the conversion explicit by use of type coercion. In this example, it would look like this:

List.map (fun x ->
  if is_heavy (x :> [`car | `bike | `cake | `anvil ])
  then string_of_vehicle (x :> [`car | `train | `bike])
  else ""
) [ `car ; `bike ]

Coercion works by allowing a [<..] to be treated temporarily as a [..] — the solution above works by using two coercions to unify x as being of type [< `car | `bike], which is compatible with both types that it is coerced to. This is, for all purposes, merely an explicit way to convert a value from a type to a supertype (“any square is a rectangle”) that is not otherwise possible in the unification-based world of Objective Caml types — unification can only work with type equality, not with types being supertypes of other types.

Supertypes, and coercion, are intricately tied to loss of information. When you convert from [<`car | `bike] to [`car | `train | `bike], you are discarding the information that your value can not be a `train. While this loss of information might seem innocuous, it does prevent you from passing that same [<`car | `bike] for a [`car | `bike | `cake | `anvil], because the latter does not allow `train and you just discarded the information that `train cannot happen…

Since loss of information can be deadly to the soundness of a type inference algorithm, the Objective Caml design decision was that all loss of information (converting a type to a supertype) should happen explicitly (by means of a coercion).

Of course, coercion is only usually required if you have closed types in your program. Try avoiding them as much as you can, and you will find that you don’t actually need coercions all that often.

The Original Problem

Back to Tomasz Wegrzanowski’s issue: he was writing a compiler that needed to manipulate both scalars and vectors. So, he defined a value type:

type value = [`Scalar of float | `Vector of float * float * float]

He also defined specific types in order to ensure type-safety of some operations:

type scalar = [`Scalar of float ]
type vector = [`Vector of float * float * float]

type operation = [ `Muliply of scalar * scalar | `DotProduct of vector * vector ]

What he wanted to achieve was to extract all the values available within an operation:

let values : operation -> value list = function
| `Multiply (a,b) ->  [a;b]
| `DotProduct (a,b) -> [a;b]

Of course, this function is discarding precious type information about the values, which requires a type cast. The naive solution would be to individually type-cast a and b throughout the function, but this would be quite tedious. Instead, let’s strike at the root by coercing the function argument itself: by discarding information before the pattern matching occurs, we allow the code to compile with a single coercion. So, what do we coerce the argument to?

It’s not easy to tell. I guess I could create a copy of the operation type and replace all types with value, but that would be quite tedious. Instead, let’s rely on a simple property : when you need to coerce a value in order to pass it to a function, you need to coerce it to the type of that function argument. And the type of the function argument can be obtained through type-inference. Simply write this and compile it:

let values : 'a -> value list = function
| `Multiply (a,b) ->  [a;b]
| `DotProduct (a,b) -> [a;b]

let _ = values ()

The compiler determines the type of 'a through unification, knows that it’s not unit, and generates an error that describes the actual expected type. We can copy-paste it from the compiler output directly!

type general_operation = [ `Multiply of value * value | `DotProduct of value * value ]

let values x = match (a : operation :> general_operation) = with
| `Multiply (a,b) ->  [a;b]
| `DotProduct (a,b) -> [a;b]

let _ = values ()

Should the type operation change, an error will immediately happen in the definition of values (as the coercion from operation to general_operation would fail), which provides the new type that should be pasted on top of general_operation.

Avoid Side Effects

Perhaps one of the single most useful aspects of functional programming is the absence of side-effects. In a pure functional language, there is no assignment — once created, you cannot change the members of an object or the value of a variable. All you can do is create new values based on old values. Advocates of these so called immutable values argue that there are many benefits related to what you can do with them: you can keep around both the new version and the old version (implementing an “undo” feature of sorts), often at a reduced memory cost due to sharing of sub-elements between the two versions.

While I agree with these, I believe there’s something far more interesting about immutable values — what you cannot do with them, and the consequences this has on feeling secure about your code.

In my current project, I use a coding style that absolutely forbids any form of side-effects with the exception of 1° a clearly delimited set of create/update functions that affect the database, 2° any side-effects that affect local variables and 3° during initialization. These are of course exceptions — while allowed, they are avoided in the vast majority of the code.

Reordering code

One simple consequence: the order in which calls are made is only constrained by things that the compiler can detect — the function that uses x is called after the function that returns x — instead of subtle «open must be called before send» errors, simply because there’s no way for «open» to affect «send» unless the former returns a value used by the latter. So, I can rearrange lines freely because the compiler will warn me when I make a mistake.

Protecting against invalid states

A corrolary of the above is that it becomes possible to design your types so that only valid states can be represented. For instance, consider a list that must contain at least one element — in a mutable context, there is no way to represent this within the type system because I could always write this:

for (i = 0 ; i < 10 ; i++)
  list.pop();

There is no way here for the pop function to indicate that it might cause a problem, because it returns nothing — the only way out is an exception, and this happens outside the type system in most languages. On the other hand, if we keep the original list unmodified and return the new list, this opens the opportunity for indicating whether additional pops are available:

type 'a one_or_more =
  | One of 'a
  | More of 'a two_or_more

and 'a two_or_more = 'a * 'a one_or_more

let pop : 'a two_or_more -> 'a one_or_more = function (_,rest) -> rest

By design, the pop function will never attempt to pop from a one-element list because it can only be called on a two_or_more value, and it returns a value which may or may not be a two_or_more — the code needs to check this, and the compiler will detect if you forget it:

let pop_n n list =
  if n = 0 then list else pop_n (n - 1) (pop list)

Here, the type of the list variable is inferred to be two_or_more (because it’s passed as an argument to pop) which means that pop list (of type one_or_more) is not a valid argument to pop_n. An appropriate solution (popping up to n, if available) takes this into account:

let pop_n n = function
  | (One _) as keep -> keep
  | more -> if n = 0 then more else pop_n (n - 1) (pop more)

Let’s summarize what happened here: I used a different representation for one-element lists and more-than-one-element lists, which let me selectively define a pop operation only for more-than-one-element lists, and this allowed the compiler to detect when the «must contain at least one element» constraint might be violated. Have fun achieving this in your non-functional language of choice! Using a different representation in this way was made possible by the fact that pop returns a new list instead of altering its argument — so the new list can be of a different type than the old one.

This is a fundamental principle in compiled functional languages: if a given state is not valid, design your types so that it may not be represented. When this is done correctly, not only does it become harder to write code that tries to create an invalid state, but when that code is written, the compiler rejects it.

Memoization

If you are certain that a given function is pure but involves computations that you would rather not repeat too often (such as database reads), you can resort to memoization: wrap the function in a piece of code that memorizes the result of every call. In my current project, this is as simple as:

let get_pic_url = Util.memoize Picture.get_url

From then on, asking ten times for the URL of a given picture will result in a single database request. Had the function involved a side-effect, memoization would have changed the program behavior: knowledge that the function is pure lets me apply memoization without any risk of creating a bug.

Retry until successful

Yet another situation is when dealing with CouchDB transactions, which involve the following process:

  1. Fetch current version of the document from the database
  2. Modify the document
  3. Write the document back to the database
  4. If a collision happened, repeat from 1.

In short, the application must repeat steps 1-2-3 until a write succeeds. Correctness of these algorithms is difficult to test because development servers seldom have enough active users to actually cause conflicts (and when conflicts do happen, they are hard to trace). By deciding that steps 1-2-3 are handled by pure functional code, I have the guarantee that the loop can be executed any number of times and yield the expected result. Right now, a transaction looks like this:

let accept id = DB.transaction id
  begin function
  | None -> (* No object in database *) DB.Keep
  | Some post ->
    if post # accepted then
      (* Post is already accepted *) DB.Keep
    else
      (* Accept post *)
      DB.Put (object
        method accepted = true
        method author   = post # author
        method text     = post # text
      end)
  end

By applying a “never use side-effects” convention, I know without even looking at the function contents that there are no side-effects inside.

These four examples are original in that they don’t try to showcase the power of immutable values and how they can be used. Instead, they simply illustrate how a program without mutable values has interesting properties that make refactoring easier, help the compiler detect many more errors, and allows the application of patterns that can only be safely applied in the absence of side-effects. Like the monks of old, I write my software without the mundane facilities of mutable values, but I derive significant benefits out of it.

Global side-effects

Functional programming languages do not allow global state—they don’t really allow any kind of state, but at least local state can be simulated by function arguments. The only way to handle “global” state would be to pass the global variable as an additional parameter (and additional return value) to every function. And it wouldn’t play nice with exceptions either, so every exception would have to carry the values of all global variables.

Besides, global state has the annoying habit of creating hidden dependencies between program parts, which lead to coupling that is hard to break and code that is not re-entrant.

On the other hand, global variables have the clear benefit of making code shorter by making many dependencies implicit. Why pass around data as an argument or a member variable if making it global works just as well?

Some kind of middle ground here would be self-propagating implicit function arguments and return values.

Such arguments would have to be declared at global scope, since they have to be globally accessible. On the other hand, since they are merely arguments, they have no value until a function is called with one, which means the declaration would probably end up looking like:

channel total : int

This would declare an implicitly progated integer argument named “total“. Then, functions could be made to read and write to that channel implicitly as if it were a normal variable.

let rec sum = function
  | [] ->
      ()
  | head :: tail ->
      total <- total + head ;
      sum tail

Since the presence of the channel can be determined statically (it happens at name resolution time)  it can also be made a part of the function’s type signature, which has the double benefit of allowing compile-time checking of channel usage and letting the programmer know which channels must be provided for a function to work:

sum : int list [total] -> unit

The presence of a list of channels before a function means the function must be called in an environment where the channel is available. This means either the environment’s type is inferred to contain that channel, or a conflict appears and the channel must be explicitly defined:

let a = 0 in
bind a as total in
  sum [ 1 ; 2 ; 3 ] ;
  print_int a ;
  sum [ 4 ; 5 ; 6 ] ;
  print_int a

The bind instruction creates an environment where the named variable or expression is automatically updated in the local scope (as if it were applied the principles of assignment I explored last week). That is, every time a function call writes to the channel, the modification is propagated back to the bound variable as soon as the function returns (and the channel always reflects the value of the variable). So the example above would print 6 and 21.

Channel implementation is fairly straightforward : every function that accesses a channel takes an implicit final argument representing the input value of that channel (if the channel is read) and returns an implicit value (if the channel is written to). That value is then locally bound, in the calling function, to either a similar constructor that propagates the use of the channel, or an explicit bind operation for the channel.

Local side-effects

In pure functional languages, variables cannot be modified. In order to perform operations that imperative programmers achieve with variable modification, functional programmers perform re-assignment:

// Imperative
x = sqrt(x);

(* Functional *)
let x = sqrt x in ...

And in the case of a loop, they represent the loop body as a function that is called with different arguments, and the modified value is propagated through the calls as an argument:

// Imperative
x = 0
for (i = 0; i < 10; ++i)
  x += i;

(* Functional *)
let x = 0 in
let rec loop i x =
  if i = 10 then x else loop (i+1) (x+i)
in
let x = loop i x in ...

So, can local modification of variables be allowed in a pure functional context through simple rewriting rules that turn the imperative constructs into their functional counterparts? Yes, although there are some limits.

Simple assignments

We know that there might be assignments within every expression (an assignment is an expression of the form x ← <V> ; <E>). The tactic used here is to turn every expression <E> into an expression of the form:

let [E'] in x1 ← v1 ; x2 ← v2 ... ; e

such that no sub-expression within the “let …” contains an assignment (and is therefore a plain old pure functional language expression) and only variable names appear after the “in …” . This rewriting tactic is applied recursively : atomic expressions (those without sub-expressions) are trivially written as such, which leaves only the question of expressions with sub-expressions that have already been recursively turned into the above format.

The key is to turn op(<A>, <B>, <C>, <D>) into:

let [A'] in
let xa* = va* in
let [B'] in
let xb* = vb* in
let [C'] in
let xc* = vc* in
let [D'] in
let e = op(a,b,c,d) in
xa1 ← va1 ; xa2 ← va2 ; ... xb1 ← vb1 ; ... ; e

The order of the sub-expression is fixed (which means an order of evaluation has to be specified for every operation). Every expression computes all its values (including the asisgned ones) and the assignment is simulated using redefinition of the variable so that subsequent sub-expressions can “see” the modified variable. The actual assignments are then pushed to the end of the complete expression so that the recursive rewriting rule will see them from the superexpression above.

Note that expressions which do not always evaluate all sub-expressions cannot be expressed as above. Fortunately, all such expressions can be rewritten as a conditional and expressions that evaluate all sub-expressions, and conditionals are evaluated above.

Note that a lambda expression is considered to be an atomic expression here, so no propagation of assignment occurs from within the anonymous function to the surrounding context! This means (as expected) that the assignments are a purely local construct that cannot cross function barriers, so I simply remove them when they reach the top level to obtain a normal pure functional expression.

This performs the transform:

(* Imperativeish *)
let x = 3.14 in
x ← sqrt x ; print_float x

(* Functional *)
let x = 3.14 in
let v = sqrt x in
let x = v in
print_float x

Loops and conditionals

Loops, conditionals are special cases of block-based expressions. A block is a language construct that looks like a lambda expression (it gathers all values from the surrounding scope) and may be executed zero, one or several times. The main difference is that a block cannot be saved for later execution, it is always executed at a specified time. In short, a block is a beta-redex. Since we have the guarantee that the block is executed before the current context resumes, we can let it alter the state of the current context.

For every block, I select a set of variables that the block may alter (although it does not necessarily do so). The block itself is syntactically an expression, so I can rewrite its internal assignments as above by moving them all to the top level of the expression. Then, I turn the block into a closure which takes as arguments the aforementioned set of variables, and returns a pair that contains the result of the expression and the final values (after assignment) of the set of variables. In short:

(* Imperativeish *)
{
  a ← 1 ;
  b ← b + 2 ;
  a + b
}

(* Functional *)
fun (a,b,c) ->
  let a = 1 in
  let b = b + 2 in
  a + b, (a,b,c)

I can add completely unused variables (like “c” above) to the set of variables simply because another branch of the construct (usually a conditional) may use that variable as well, and I need both blocks to be functions of the same type.

Then, by transforming any blocks into functions, a conditional follows the rewriting rule :

(* Imperativeish *)
let r = if cond then A else B in ...

(* Functional *)
let r, (a,b,c) = if c then A(a,b,c) else B(a,b,c) in ...

Loops work in the same way :

(* Imperativeish *)
while c do
  A
done ; ...

(* Functional *)
let rec loop (a,b,c) =
  if c then
    let _, (a,b,c) = A (a,b,c) in
      loop (a,b,c)
  else (a,b,c)
in let (a,b,c) = loop (a,b,c) in ...

Records

All of the above only handles assignment to variables. What about assigning to records?

It is of course impossible to alter a record held by someone else. However, if the record is stored in a local variable, then it is possible to change the local variable to take this into account.

The rewriting rule is quite simple, and turns a complex assignment (assign to a record) into a less assignment recursively:

x.label ← y   becomes   x ← { x with label = y }
var           remains the same
anything else causes an error

So, this rule would perform the following transform:

(* Imperativeish *)
x.owner.details.name ← boris ; ...

(* Functional *)
let x =
  { x with owner =
    { x.owner with details =
      { x.owner.details with name = "boris"} } }
in ...

The same approach can be applied to most other assignment operations (array, string, hash table).

Semantic & Symbolic

Our brain interprets variables in two different fashions : semantic and symbolic. Semantic understanding means understanding the meaning of the variable’s name, where long and detailed names like “newCustomers” convey information in plain english and shorter names like “i” and “x” convey information through conventions. Symbolic understanding relies on recognizing the shape of the variable in several locations in code, and deducing from there its actual meaning—in itself, the name is not relevant, your brain just goes “hey, that’s the same variable“. Of course, there is some amount of semantic recognition to symbolic variables, usually because we understand the variable as being a symbol.

Mathematics make much use of symbolic recognition. After all, mathematicians do not write f(number) = 10 × number, they write f(x) = 10x and while one-letter variables do have some amount of semantics associated to them (i,j,k,m,n are integers, x,y,z are reals, p is a prime number, q and r are rationals, f,g,h are functions, t is often seen as time, d is a divisor, P is a predicate or a polynomial) this minimal amount of information is ridiculous when compared to the huge amounts of purely symbolic information one gathers from the use of the letter.

Symbolic recognition works better when the expression is small. In descending order of readability when you’re familiar with the language,

Mathematical notation:

ƒ : A → ∃n∈A. 2|n

Objective Caml:

let f(a) = List.exists (fun n -> n mod 2 = 0) a

C++:

bool f(const std::vector<int> &a) {
  for(std::vector<int>::const_iterator it = a.begin(); it != a.end(); ++it)
    if (*it % 2 == 0) return true;
  return false;
}

I guess there are two things one can learn from this.

First, in a terse language, symbolic recognition works better, which in turns means the programs can be even more terse while retaining their understandability.

Second, don’t bother with long variable names in a two-line function if all the information present in the name can be readily and easily deduced from the two lines in the function.

Value Consistency

Jeff Atwood hates PHP. So do I—despite working with it on a daily basis and leading several interesting PHP-based projects to successful completion, I do find PHP to be a disaster in terms of language design, and despite its gradual improvements over the course of its evolution.

And yet, my biggest issue with PHP is the same I have with C and C++ and, arguably, every single language in the C family with the exception of JavaScript.

Some Examples

Consider a simple get-day-name function:

function french_day_name($timestamp)
{
  return array(
    "Dimanche", "Lundi",    "Mardi",   "Mercredi",
    "Jeudi",    "Vendredi", "Samedi"
  ) [(int)date('w',$timestamp)];
}

You feed in a timestamp, and you get the day’s name in French. It relies on the fact that array() is an array literal, which is an expression (you can assign it to a variable or pass it as an argument to a function) and [] subscripting can be used to extract a certain index from an array. Except that this doesn’t work because the syntax does not allow subscripting into an array literal. Contrast this with JavaScript :

function french_day_name(timestamp)
{
  return [
    "Dimanche", "Lundi", "Mardi", "Mercredi",
    "Jeudi", "Vendredi", "Samedi"
  ] [(new Date(timestamp)).getDay()];
}

This is what I call value consistency: if your semantics allow writing array[subscript], then I expect to be able to use an array literal (or any other expression that evaluates to an array) in that expression. As a matter of fact, I don’t know what kinds of expressions support subscripting in PHP, and that makes me feel insecure so I always assign my array to a variable beforehand.

The same happens in other situations as well. For instance, don’t expect to write (new Date(timestamp)) -> getDay() in PHP: there are restrictions on what kind of expressions are allowed on the left-hand side of the object member operator, too.

Also, don’t expect to write:

$obj -> random = $good_random ? 'mt_rand' : 'rand' ;

// Later

$obj -> random(); // Sorry, not a member function

($obj -> random)(); // Sorry, bad syntax

$rand = $obj -> random();
$rand(); // Good programmer, you get a twinkie

That’s right, there are restrictions on what kind of expressions can be used to call functions as well! At least JavaScript is sane about this.

Value Consistency

For consistency, every operation that can operate on an arbitrary value should have a syntax that lets the programmer specify that value as an expression, whether that expression is a literal, an identifier, or any complex expression that ultimately evaluates to an appropriate value.

There are no circumstances under which syntax alone should prevent an expression to be used in a context where another semantically identical expression can be used.

You’re Not My Type

Consider an extremely simple language. For instance, a language that only has boolean and integer values. You could write, say, MacCarthy’s 91 function with it:

def maccarthy(n):
  if n <= 100
    maccarthy(maccarthy(n+11))
  else
    n - 10

Type-checking this language would be trivially easy, even if you don’t have any type annotations. You’d just have to check whether a given expression is used as an integer, a boolean, neither, or both (and throw an exception in the “both” case). Here, for instance, the operation “n <= 100″ would constrain “n” as an integer and the operation “n-10″ would constrain the return type as an integer, then the various type constraints would be verified for all other expressions.

Requirements and Properties

Static verification involves determining, for every static piece of code (usually, at the expression level):

  • The requirements that allow the code to run correctly. For instance, the fact that a certain value needs to be comparable, that a certain value must be of the same type as another, or that a certain value must be an object with a certain member.
  • The properties that are provided by that code. For instance, the fact that a certain expression always evaluates to an integer, that it always evaluates to an object with a certain member, or that it’s always equal to 91.

In its simplest form, the algorithm would determine the entire set of requirements and properties, and then determine whether the properties satisfy the requirements. For instance, if a variable does not have the “is an integer” property but is subject to a “must be an integer” requirement, the verification fails and indicates the error.

This is the approach used in languages with explicit type annotations, such as C. There, whenever a value is created, its properties are known (whether by the type annotation in its definition or because of the return type of the function that returned it), and whenever it is used its requirements are known (whether by the type annotations of the called function parameters, of the current function’s returned value, or because of a syntactic requirement such as the condition in an ‘if’ statement). And so, starting with the parameters and ending with the return value, the properties for every sub-expression in every function’s body are deduced and then verified.

Of course, the version used in C++ is more complex than that. For instance, the “property satisfies requirement” rule is not a simple yes-no question, but instead has several levels of priority involving implicit conversions, built-in conversions and user-defined conversions (for instance, the property “is an integer” satisfies the “must be a floating-point number“, but with a lower priority than “must be an integer“) in order to resolve overloading and template functions. But the general idea remains the same.

Type Inference

The issue with this simple approach is that it requires annotations—the code deduces properties from type annotations, so the types of all function parameters must be specified. The return type specification can be deduced in some cases (for instance, if there is only one return statement, or if all return statements return the exact same type), and the type of local variables can be deduces from the function or expression used to initialize them. So, you would be left with a “type-inferred” language where the only type annotations ever required (outside of type definitions themselves) would be for function parameters.

It’s also possible to get rid of these type annotations, but it requires an entirely different algorithm : since you don’t know what are the properties of your function parameters, how can you determine whether they satisfy the requirements within your function?

This could be done by allowing the static checker to generate properties based on the requirements that must be satisfied. Of course, that doesn’t tell us how this could be done and, in the general case, it cannot be done. Consider C++ overloading: if a function is overloaded for both integer and vector arguments, then should its argument be an integer or a vector? Introducing an “is an integer or a vector” property is impractical in most cases, and it doesn’t tell the code generator whether that variable is a one-register integer or an SSE-friendly vector—a fatal flaw.

So, full type-inference algorithms only apply in languages with type systems that allow deducing properties from requirements. One example would be the ML type system, which merges properties and requirements together in a single concept of “type”. Since in that system a requirement is a property, there’s no difficulty determining what property a requirement corresponds to.

This means that every operation is not a top-down “I deduce the result from the arguments” process, but rather a two-way “I connect the types of the result and the arguments”. The connection can be, for instance, a type equality (being equal to a certain type, or to the type of another expression), as is the case for the equality operator: if you compare two expressions, then their types are constrained to be equal to each other, which means that in the example below the type of x would be properly inferred as being an integer:

let is_zero x =
  x = 0
;;

This system works pretty well as long as you don’t have types that are too complex. However, any serious language has to handle containers (quite often, functions and objects as well), and these can cause some issues with what has been done so far.

Parametric Types

PHP’s approach to types is pretty straightforward. You have a short list of possible types:

Integer, Floating-Point, Boolean, String, Object, Array, Resource and NULL.

Every single value manipulated inside a PHP program has exactly one type among the above. This makes the language quite simple to grasp, but makes compile-time type-safety impossible to achieve: even if you know the compile-time types of a handful of objects (such as function parameters and return values), you still have no guarantees about what type of objects an array holds, or whether an object has a certain member (and if it does, what its type is), because that information is not part of the extremely simple type system in PHP.

The array problem is usually solved by adding parametric types to the language: instead of an object being an “Array“, the object instead becomes an “Array of X“. This allows the compiler to determine the type of the objects inside the array (if it’s an “Array of X“, then the type of the element is X). Most languages with a static type system (C, C++, Java, C#, OCaml, Haskell to name a few) implement arrays as parametric types.

Some languages stop there: they add only a handful of built-in parametric types (C pointers and arrays) and rely on user-defined types for handling everything else. Other languages push it a step forward and introduce generics or generic-like constructs (C++ templates, Java and C# generics). And yet other languages use parametric types to their full extent to deduce one type from another both ways (OCaml, Haskell).

The ML type system supports parametric types. In fact, because of the way it’s defined, it works because of parametric types. Every final ML type (that is, those types of values that actually end up appearing in the running program) is a formal construct, a tree where every node is a type name and the children of that node represent the value of the type parameters (types with no type parameters are leaves). For instance, an array of integer pairs is described as:

(int * int) array

Or, as a tree-like structure:

[ int ]--
         |--[ * ]---[ array ]
[ int ]--/

The type-checking system allows “unknown” types: these are types which have no constraints and are therefore universal. They are usually represented as quoted type names. For instance, an array of pairs of unknown types:

[ 'a ]--
        |--[ * ]---[ array ]
[ 'b ]--/

Or, in linear form:

('a * 'b) array

This type represents an array of pairs, but the types of the first and second elements are not constrained, so they show up as “anything”.

Type Unification

ML types work through unification. As explained above, every expression applies constraints to types. These are almost always equality constraints (explaining that two types are equal). For instance, I could decide that types (‘a * int) array and (int * ‘b) array are equal, and the algorithm would determine that both types are in fact (int * int) array by noticing the ‘a = int and int = ‘b equality constraints. This is known as unification.

The unification process is a simple tree traversal process.

  1. Unification of two trees with different roots (that is, roots that have different names) is always an error. This is why you can’t unify an ‘a list with an ‘a array: the “list” root is incompatible with the “array” root.
  2. Unification of any tree with an unknown tree replaces the unknown tree with the other. This replacement happens everywhere within the program. This also happens if both trees are unknown, of course.
  3. Unification of two known trees with the same root resolves to recursive unification of children (so child N of the first tree is unified with child N of the second tree).

In practice, this means that the unification algorithm determines if there’s a tree that’s compatible with both unified trees (otherwise, there’s an error because you have two incompatible uses of a given value) and if there is one it selects the smallest (so that there are no additional assumptions made on the type beyond those of the two original ones).

Certain expressions perform unification between two existing types. Most expressions actually construct a new intermediary type and perform unification between that new type and other types. For instance:

  • if A then B else C” unifies A and bool, then unifies B and C, then unifies the expression’s type with B.
  • F A” unifies F with ‘a -> ‘b, unifies ‘a with A, and unifies the expression’s type with ‘b.
  • A.(I)” unifies A with ‘a array, unifies I with int, and unifies the expression’s type with ‘a.


1150 feed subscribers
(readers who polled a feed this week)