Daily Archive for January 28th, 2009

You’re Not My Type

Consider an extremely simple language. For instance, a language that only has boolean and integer values. You could write, say, MacCarthy’s 91 function with it:

def maccarthy(n):
  if n <= 100
    maccarthy(maccarthy(n+11))
  else
    n - 10

Type-checking this language would be trivially easy, even if you don’t have any type annotations. You’d just have to check whether a given expression is used as an integer, a boolean, neither, or both (and throw an exception in the “both” case). Here, for instance, the operation “n <= 100″ would constrain “n” as an integer and the operation “n-10″ would constrain the return type as an integer, then the various type constraints would be verified for all other expressions.

Requirements and Properties

Static verification involves determining, for every static piece of code (usually, at the expression level):

  • The requirements that allow the code to run correctly. For instance, the fact that a certain value needs to be comparable, that a certain value must be of the same type as another, or that a certain value must be an object with a certain member.
  • The properties that are provided by that code. For instance, the fact that a certain expression always evaluates to an integer, that it always evaluates to an object with a certain member, or that it’s always equal to 91.

In its simplest form, the algorithm would determine the entire set of requirements and properties, and then determine whether the properties satisfy the requirements. For instance, if a variable does not have the “is an integer” property but is subject to a “must be an integer” requirement, the verification fails and indicates the error.

This is the approach used in languages with explicit type annotations, such as C. There, whenever a value is created, its properties are known (whether by the type annotation in its definition or because of the return type of the function that returned it), and whenever it is used its requirements are known (whether by the type annotations of the called function parameters, of the current function’s returned value, or because of a syntactic requirement such as the condition in an ‘if’ statement). And so, starting with the parameters and ending with the return value, the properties for every sub-expression in every function’s body are deduced and then verified.

Of course, the version used in C++ is more complex than that. For instance, the “property satisfies requirement” rule is not a simple yes-no question, but instead has several levels of priority involving implicit conversions, built-in conversions and user-defined conversions (for instance, the property “is an integer” satisfies the “must be a floating-point number“, but with a lower priority than “must be an integer“) in order to resolve overloading and template functions. But the general idea remains the same.

Type Inference

The issue with this simple approach is that it requires annotations—the code deduces properties from type annotations, so the types of all function parameters must be specified. The return type specification can be deduced in some cases (for instance, if there is only one return statement, or if all return statements return the exact same type), and the type of local variables can be deduces from the function or expression used to initialize them. So, you would be left with a “type-inferred” language where the only type annotations ever required (outside of type definitions themselves) would be for function parameters.

It’s also possible to get rid of these type annotations, but it requires an entirely different algorithm : since you don’t know what are the properties of your function parameters, how can you determine whether they satisfy the requirements within your function?

This could be done by allowing the static checker to generate properties based on the requirements that must be satisfied. Of course, that doesn’t tell us how this could be done and, in the general case, it cannot be done. Consider C++ overloading: if a function is overloaded for both integer and vector arguments, then should its argument be an integer or a vector? Introducing an “is an integer or a vector” property is impractical in most cases, and it doesn’t tell the code generator whether that variable is a one-register integer or an SSE-friendly vector—a fatal flaw.

So, full type-inference algorithms only apply in languages with type systems that allow deducing properties from requirements. One example would be the ML type system, which merges properties and requirements together in a single concept of “type”. Since in that system a requirement is a property, there’s no difficulty determining what property a requirement corresponds to.

This means that every operation is not a top-down “I deduce the result from the arguments” process, but rather a two-way “I connect the types of the result and the arguments”. The connection can be, for instance, a type equality (being equal to a certain type, or to the type of another expression), as is the case for the equality operator: if you compare two expressions, then their types are constrained to be equal to each other, which means that in the example below the type of x would be properly inferred as being an integer:

let is_zero x =
  x = 0
;;

This system works pretty well as long as you don’t have types that are too complex. However, any serious language has to handle containers (quite often, functions and objects as well), and these can cause some issues with what has been done so far.

Parametric Types

PHP’s approach to types is pretty straightforward. You have a short list of possible types:

Integer, Floating-Point, Boolean, String, Object, Array, Resource and NULL.

Every single value manipulated inside a PHP program has exactly one type among the above. This makes the language quite simple to grasp, but makes compile-time type-safety impossible to achieve: even if you know the compile-time types of a handful of objects (such as function parameters and return values), you still have no guarantees about what type of objects an array holds, or whether an object has a certain member (and if it does, what its type is), because that information is not part of the extremely simple type system in PHP.

The array problem is usually solved by adding parametric types to the language: instead of an object being an “Array“, the object instead becomes an “Array of X“. This allows the compiler to determine the type of the objects inside the array (if it’s an “Array of X“, then the type of the element is X). Most languages with a static type system (C, C++, Java, C#, OCaml, Haskell to name a few) implement arrays as parametric types.

Some languages stop there: they add only a handful of built-in parametric types (C pointers and arrays) and rely on user-defined types for handling everything else. Other languages push it a step forward and introduce generics or generic-like constructs (C++ templates, Java and C# generics). And yet other languages use parametric types to their full extent to deduce one type from another both ways (OCaml, Haskell).

The ML type system supports parametric types. In fact, because of the way it’s defined, it works because of parametric types. Every final ML type (that is, those types of values that actually end up appearing in the running program) is a formal construct, a tree where every node is a type name and the children of that node represent the value of the type parameters (types with no type parameters are leaves). For instance, an array of integer pairs is described as:

(int * int) array

Or, as a tree-like structure:

[ int ]--
         |--[ * ]---[ array ]
[ int ]--/

The type-checking system allows “unknown” types: these are types which have no constraints and are therefore universal. They are usually represented as quoted type names. For instance, an array of pairs of unknown types:

[ 'a ]--
        |--[ * ]---[ array ]
[ 'b ]--/

Or, in linear form:

('a * 'b) array

This type represents an array of pairs, but the types of the first and second elements are not constrained, so they show up as “anything”.

Type Unification

ML types work through unification. As explained above, every expression applies constraints to types. These are almost always equality constraints (explaining that two types are equal). For instance, I could decide that types (’a * int) array and (int * ‘b) array are equal, and the algorithm would determine that both types are in fact (int * int) array by noticing the ‘a = int and int = ‘b equality constraints. This is known as unification.

The unification process is a simple tree traversal process.

  1. Unification of two trees with different roots (that is, roots that have different names) is always an error. This is why you can’t unify an ‘a list with an ‘a array: the “list” root is incompatible with the “array” root.
  2. Unification of any tree with an unknown tree replaces the unknown tree with the other. This replacement happens everywhere within the program. This also happens if both trees are unknown, of course.
  3. Unification of two known trees with the same root resolves to recursive unification of children (so child N of the first tree is unified with child N of the second tree).

In practice, this means that the unification algorithm determines if there’s a tree that’s compatible with both unified trees (otherwise, there’s an error because you have two incompatible uses of a given value) and if there is one it selects the smallest (so that there are no additional assumptions made on the type beyond those of the two original ones).

Certain expressions perform unification between two existing types. Most expressions actually construct a new intermediary type and perform unification between that new type and other types. For instance:

  • if A then B else C” unifies A and bool, then unifies B and C, then unifies the expression’s type with B.
  • F A” unifies F with ‘a -> ‘b, unifies ‘a with A, and unifies the expression’s type with ‘b.
  • A.(I)” unifies A with ‘a array, unifies I with int, and unifies the expression’s type with ‘a.


693 feed subscribers
(readers who polled a feed this week)