Monthly Archive for August, 2008

It Will Fail

These are a few bits of wisdom I have gathered over time, related to software development. These are related to the fact that when left in the hands of humans, things always go wrong.

A best practice is only as good as its weakest link, which in most cases is the enforcement of that practice. A bad developer might go against best practices out of laziness or perhaps misunderstanding of the benefits, while a good developer might be in a rush and write a quick hack that he intends to correct after a heavy deadline, or simply forget to perform a specific part of the iteration process.

Every practice which relies merely on human enforcement is bound to be broken. Adding people in charge of verification decreases the probability of this happening, but increases the latency of the development cycle as software has to pass several layers of checks.

It will fail.

If you can imagine a situation in which a piece of software could fail, then that situation will inevitably arise, either in development or in a production environment, over the lifetime of the software. It’s hard enough to deal with unforeseen failure causes to let the foreseen causes alone.

If you don’t fix it, mention it.

Of course, not all problems are a priority. If a deadline looms and a possible error appears in a non-essential piece of code, correcting that possibility is definitely not a good idea. However, the knowledge of a possible failure should never be thrown away.

Always mention a possible failure cause in the code. Most IDEs provide the functionality to extract comments prefixed with //TODO in order to have a list of unsolved cases. Add the information where it would be most obvious to see to someone who is debugging the issue later on : right above the statement that would fail an assertion or throw an exception (if nothing would fail, then add an assertion of your own, as these disappear when optimizations are turned on) or, failing that, in the documentation of the function that would return spurious values.

It is also very useful to log the issue in the bug tracking software used for the project, allowing the other developers to manipulate the issue with the appropriate tools.

Redundant things aren’t.

If the design of your software requires that a given piece of information be identically present in two different locations, then the two locations will inevitably go out of sync.

While this is usually taken to apply to software managing synchronized data distribution over several threads or even computers, it has a much more immediate practical information when applied to humans handling the data replication by hand. For instance, for a user to connect to a private server, he must use the password that the server expects. Out-of-sync means here that changing the server password will lock users out. Worse, the written down copy of the password that the administrator keeps in a safe locker will also go out of sync sooner or later.

Always plan for synchronization loss : since it will inevitably happen, regardless of how much effort is made, what steps can be taken to save the day when it does happen? One possibility is to document the synchronization requirement, so that people encountering the error will know that they have to duplicate some data. Also, if configuration files are accessible and modifiable by scripts, automating the data synchronization is an interesting possibility.

People forget.

The more complex the usage of a given piece of functionality, the fewer people can use that functionality. Besides, if the functionality is not used often enough, it fades from the skill set of those who had been using it, until nobody knows how it works exactly.

As such, never let your processes rely on undocumented software. If the solution to an issue was not found in the documentation of the software, store it in a knowledge base (it could be an online wiki site, or something more complex). When a bug in a third party tool is worked around, also write down the solution.

Similarly, always keep the code well-documented, and keep architectural descriptions. Use conventions used by all developers: add a README file to your project, and store the documentation in a folder with the right name at the project root. Part of a good documentation is being able to easily find the documentation for a given element.

Serializing to PHP Code

For performance purposes, it can sometimes be interesting to have PHP code automatically generate other pieces of PHP code. In order to insert non-constant data into those pieces of code, the most elementary solution is to serialize everything as a string, and the unserialize them on the fly. While slow, this has the advantage of being very simple.

However, it’s possible to provide a function that handles all the special cases: numbers are printed as numbers, strings as correctly quoted and escaped strings, arrays are defined in the correct order and with the correct keys. Only objects (which may have specific sleep/wakeup code to be executed to allow them to remain alive) are actually saved.

function tophp($x)
{
  if (is_numeric($x))
    return "$x"; 

  if (is_string($x))
    return "'" . addcslashes($x, '\\\'') . "'"; 

  if (is_array($x))
  {
    foreach($x as $key => $id) $txt .= tophp($key) . '=>' . tophp($id) . ',';
    return 'array(' . $txt . ')';
  } 

  if (is_object($x))
    return "unserialize('" . addcslashes(serialize($x), '\\\'') . "')"; 

  if (is_bool($x))
    return $x ? 'true' : 'false'; 

  if (!isset($x))
    return 'null'; 

  assert (false);
}
I hereby declare this code to be in the public domain, where allowed by law.

Compile-time Syntax Validation

The premises

In the last installment of Functional Tuesdays, it was mentioned that Objective Caml could be used to write a PHP view that was verified by the generator program as creating valid XHTML. While many tasks will have to be programmed directly in Objective Caml to perform the validation, some of the more elementary parts of validation can be done directly in the type system of the language.

Omitting a lot of details (attributes, unwanted tags, self-closing tags, comments…), the grammar of XHTML looks like:

text : any piece of text with quoted entities
p    : <p>(text|span)*</p>
span : <span>(text|span)*</span>
div  : <div>(p|span|div|ul|ol|text)*</div>
li   : <li>(p|span|div|ul|ol|text)*</li>
ul   : <ul>li*</ul>
ol   : <ol>li*</ol>
body : <body>(p|div|ul|ol)*</body>
html : <html>body</html> 

It’s already possible to write one function for each of these non-terminals, and this has already been done. Formatting the things a little bit better:

module TagImpl = struct
  let tag n i = Printf.sprintf "<%s>%s</%s>" n (String.concat "" i) n 

  let p i = tag "p" i
  and li i = tag "li" i
  and ul i = tag "ul" i
  and ol i = tag "ol" i
  and div i = tag "div" i
  and span i = tag "span" i
  and body i = tag "body" i
  and html i = tag "html" i
  and text i = i 

  type 'a block = string
  let to_string s = s
end

Module types

The main selling point of Objective Caml as a functional language is its excellent module and type system. In this case, the above module is type-unsafe (in the area of generating a valid XHTML document) because it manipulates all nodes as strings. This makes it possible to create a document such as <html><p>Hello!</p></html> (with html [p [text "Hello!"]]). To make this module type-safe, one can apply a module type, thereby restraining the signature of the functions in the module to have a specific type. This is the purpose of the two additional definitions of ‘a block (all functions will take as arguments lists of blocks and return blocks) and to_string (to return a string representing a block).

The type parameter of the block type serves to convey information about what the block is. Since Objective Caml supports variant types, it’s possible to use a variant type as that type parameter, with a distinct variant label representing each of the tags. For instance, a <p> tag will be represented by the label `P, and any functions which accept a paragraph tag as their internal XHTML code will take `P as a possible label.

The result is this module type:

module type TAG = sig
  type 'a block
  val text : string -> [>`TEXT] block
  val p    : [<`TEXT|`SPAN]                  block list -> [>`P] block
  val li   : [<`SPAN|`TEXT|`UL|`OL|`P |`DIV] block list -> [>`LI] block
  val ul   : [<`LI]                          block list -> [>`UL] block
  val ol   : [<`LI]                          block list -> [>`OL] block
  val div  : [<`SPAN|`TEXT|`UL|`OL|`P |`DIV] block list -> [>`DIV] block
  val span : [<`TEXT|`SPAN]                  block list -> [>`SPAN] block
  val body : [<`DIV|`P|`UL|`LI]              block list -> [>`BODY] block
  val html : [<`BODY]                        block list -> [>`HTML] block
  val to_string : 'a block -> string
end

The type of the blocks accepted in the “inner XHTML” list of each tag is restricted to be one of the possible labels, while the returned block type is constrained to contain at least the type of block created by that tag. The returned block is not exactly the type of created block because this would prevent placing the returned blocks in lists of distinct blocks (such as a block containing both text and spans).

Error detection

The basic errors are detected by this type system:

module Tag = (TagImpl : TAG);;
open Tag;; 

let _ = html [text "Bad!"];;
(*            ^^^^^^^^^^^
This expression has type [> `TEXT ] Tag.block but is here used with type [ `BODY ] Tag.block
*)
let _ = p [ ul [] ];;
(*          ^^^^^
This expression has type [> `UL ] Tag.block but is here used with type [< `SPAN | `TEXT ] Tag.block
*)
let _ = ul [ div [] ];;
(*           ^^^^^^
This expression has type [> `DIV ] Tag.block but is here used with type [ `LI ] Tag.block
*)

Additionally, the errors are fairly clear: the current type of the block being insertedis shown first, followed by the type of blocks that are allowed in that position.

As an advanced feature, this also plays well with the Objective Caml type system, allowing the user to define functions which create subexpressions from other subexpressions and still having everything type-safe. For instance, a function which takes a list of blocks and turns it into an XHTML list with one block per <li></li>.

let decorate elts = ul (List.map (fun x -> li [x]) elts);;
(* val decorate :
    [< `DIV | `P | `SPAN | `TEXT ] Tag.block list -> [> `UL ] Tag.block = <fun>
*)
let _ = decorate [text "Hello"; div [text "world!"]; body []];;
(*                                                   ^^^^^^^
This expression has type [> `BODY ] Tag.block but is here used with type
  [< `DIV | `P | `SPAN | `TEXT > `DIV `TEXT ] Tag.block
*)

The error message is slightly less clear here because of the > `DIV `TEXT section (indicating that the list already contains div tags and text sections). Another example would be a function which decides whether to wrap some text in a <p> or a <div>, the output of such a function cannot be used safely in a <p>:

let choose elts =
  if Random.int 2 = 1 then p elts else div elts;;
(* val choose : [< `SPAN | `TEXT ] Tag.block list -> 
                [> `DIV | `P ] Tag.block = <fun>
*)
let _ = choose [div []];;
(*              ^^^^^^
This expression has type [> `DIV ] Tag.block but is here used with type
  [< `SPAN | `TEXT ] Tag.block
*)
let _ = p [choose []];;
(*         ^^^^^^^^^
This expression has type [> `DIV | `P ] Tag.block but is here used with type
  [< `SPAN | `TEXT ] Tag.block
*)

Even better, it can detect ahead of time that some functions cannot be used, for instance a function which decides whether to wrap in <p> (accepts only text or spans) or in <ul> (accepts only list elements):

let impossible elts =
  if Random.int 2 = 1 then ul elts else p elts;;
(*                                        ^^^^
This expression has type [ `LI ] Tag.block list but is here 
  used with type [< `SPAN | `TEXT ] Tag.block list
These two variant types have no intersection
*)

Further work

Now that the XHTML tree itself is valid, it is interesting to consider inline PHP code as being typed blocks (based on the type they can output), which in turn leads to safe generation of a valid document witin the PHP code.

C++ Warts

What warts are

Warts, in programming jargon, refer to any information appended to a symbol’s name in addition to the core semantic name of that symbol, usually in a highly abbreviated and standardized fashion. Examples of wart-bearing names (with the warts in red) would be lpszName, CUserSocket, g_project_config, author_t, and next_. In themselves, warts are just a way of conveying additional information about a symbol, usually a property such as its kind (class, template, interface), type (long pointer to a string), scope (global or member) or some more specific information (encoding, being a row or column index, and so on).

Warts fall into three major categories:

  • Useful, these convey a relevant information that would not be easily conveyed otherwise.
  • Useless, these convey some information that can be readily gathered from the name or the usage.
  • Useless and dangerous, these convey some information with some negative side-effects on code quality.

Those that are Useful

To be useful, a wart must be small enough to avoid cluttering the name (most people consider the clutter limit to be at most one character long), summarize a piece of information that is not obvious from the context of use (and cannot be made obvious by using the language), and provide a large enough level of understanding that it can severely influence the plans of the programmer.

The canonical example of an useful wart is the “interface” wart, which consists in prepending an uppercase i (for interface) to the name of an interface (or, in C++, an abstract class which serves as an interface). The message conveyed is that the symbol represents a type which provides no functionality, and merely defines a list of methods that must be provided by any subtype. There is a big difference between a function with a signature such as:

void Print(const Formatter &fmt)

And a function with a signature like:

void Print(const IFormatter &fmt)

The first function brings up the idea that an instance of the codebase-provided Formatter class must be created, customized either through constructor arguments or through member functions, and then passed to the function. The second function brings up the idea that one must instead use an existing implementation of the formatter, or perhaps create a new one. Of course, the wart somewhat loses its usefulness when inside code, because an intellisense-like tool can easily convey the same information. However, in documentation summaries or samples, or when working on the white board, these are useful.

Those that are useless, but benign

The leading contenders here would be the C prefix (indicating that a type is a class), the T prefix (indicating that a symbol is a template), the _t suffix (indicating that a symbol is a type), the g_ and m_ prefixes (indicating respectively that a symbol is a global or member variable). The latter is sometimes found as a trailing underscore.

All of this information can either be clearly inferred from the context in any reasonably well-written program (not only should all programs be well-written, but programs which use warts generally use them in order to write the program well), or provide no actual relevant information.

The starting point would be the C prefix. Indicating that a symbol is a class type can serve two purposes: differentiating it from non-class types, and differentiating it from non-types. The first distinction is pointless: the only situation when a class behaves differently from another type in a way that actually matters is when looking at PODs (and even then, classes can be PODs as well) and code that relies on that distinction is extremely rare.

As for labeling types to differentiate them from variables, this is a good idea in C, where no C++ namespaces exist and thus collisions must be avoided between type names and variable names. In C++, however, most serious projects place their types into namespaces, and the global namespace can be used as a last resort if necessary. Furthermore, even though it can be easily solved, ambiguity between type names and other symbols arises only with a high rarity anyway.

The T prefix to templates may appear as a good idea in the same way as the I prefix. However, a template is nothing more than an ad hoc meta-function which uses ‘<>’ instead of ‘()’ for its parameter list. As such, there is no more need to differentiate templates from non-templates than it is to differentiate functions from non-functions: the template parameter list is almost always present on a template (removing any ambiguity), and the only case where it is not present is when passing the template as an argument to another template, a situationboth too rare and too complex to be resolved without knowing what the other template is doing.

The g_ prefix for global variables does provide information that is relevant and important. However, it can be easily gathered from the context in a well-written program, simply by namespacing it: a namespaced variable is always obviously distinct from a local or member variable. The advantage of adding the namespace prefix is that the ownership of the global variable is more clear (since the namespace will reflect which part of the program the global belongs to, if any), and the prefix can be ignored easily in a context where it is obvious that a global variable is being manipulated. Also, using global variables in C++ is a typical recipe for disaster (in particular because of order-of-initialization problems) and is usually replaced by a global function returning a reference to an internally managed global object instead, making this point more of a historical anecdote.

The m_ prefix (or underscore suffix) for member variables is also useless: the vast majority of uses of a member variable happens in a context where the nature is a syntactical fact (such as object.member, object -> member or constructor() : member(data)) where a wart is entirely useless. The situation where ambiguity can arise is when accessing a member variable in a member function. Two possibilities exist here: either the function is small enough that the variable, being neither a local variable nor a function parameter, is necessarily a member, or the function is large enough to warrant an effort. This can be done by prepending a this -> to the variable name, which serves the same purpose as a wart, but has the advantage of being removable when unnecessary.

Those that are useless and dangerous

Worse that being useless, is being dangerous. Most useless warts carry redundant (and thus useless) information by using up only one or two characters. Some warts, however, can carry incorrect information, use up too much space, or even prevent code from being portable.

Correct information appears mostly with warts that carry type information. During the lifetime of a program, it’s possible for the type of variables to change to accomodate program extensions or bug corrections. Expecting the programmer doing the modification to also propagate the name change is a good idea in theory, but hopeless in practice. Inevitably, out of laziness, honest mistake or looming deadlines, the type wart will become obsolete, and will lead a maintenance programmer into thinking that iLength is an integer when in fact it is a floating-point number.

Using too much space becomes a real difficulty with type warts: either class instances get useless warts such as ‘o’ or ‘obj’ to indicate that they are objects (which is a quite pointless fact in object-oriented software), or they get more detailed ones. This is to say nothing of the typical long-pointer warts so widespread in the Win32 C API (meet the lplptstrUsername internationalization-friendly pointer to an username). While in itself just a readability issue of making variable names longer, these warts often lead the programmers to shorten the non-wart part of the name to keep line lengths manageable, leading to lplptstrUsrnm or worse.

Last but not least, there is the case of the prefix underscore: when an identifier begins with an underscore followed by a capital letter, that identifier is reserved by the implementation. This means that code can break when changing compilers or compiler versions.

Conclusion

While warts are not always necessarily bad, many are completely useless, requiring additional uniformization effort for no real gain besides, perhaps a very subjective aesthetic pleasure and a warm and fuzzy feeling for the instigator, and some are actually dangerous to use despite their initial appeal, because information redundancy, when maintained by humans, is seldom redundant in the face of changes.

Persistent PHP Closures

As a sequel to the previous installment of Dynamic Wednesdays, this article considers the next step in closure manipulation in PHP: persistence. Persistence is the process of preserving a part of program state from one execution to another. In the execution model of PHP, scripts are executed independently (and concurrently) and their working memory is flushed from memory when they end. The only preserved data is that which is sent back to the HTTP client, and that which is saved in the database.

This makes several typical design patterns quite impractical for any usage that extends beyond a single HTTP request: for instance, an API designer may wish the users of the API to be notified of certain events (for instance, the modification of a piece of data). In a typical non-transactional application, the user modules would register themselves with the API by use of the Observer design pattern (or, in more functional terms, a callback) which would remain stored in the API until the application shuts down, and the observer would be notified of the relevant events. This kind of behavior is, in terms of functionality, perfectly supported by PHP.

However, for this to work, every HTTP request must execute in a completely initialized environment. This requires the server to create a new instance of every core object, then load and initialize all objects provided by the third party users (thereby setting up the aggregation links required by the Observer design pattern through registration with the core objects), and finally run the HTTP request which, most of the time, will not trigger the observer at all. In short, using these techniques in PHP requires massive and mostly useless initialization times that would be best done without.

Partial workarounds do exist: after all, the only requirement to avoid initialization costs is to persist the inter-object relationships in the database. For instance, if a certain user wants to be notified when a certain product is back in stock, his user identifier would be associated with the product identifier in a “waiting for stock” table in a relational database. Observing that a user was associated to the product, the server would then load the user-related code, seek the user by its identifier, and proceed to send the notification. This partial solution, however, is incomplete, since it does not support the Open-Closed principle. Since neither the source code to be loaded (user-related, for instance) nor the relational database table allow for polymorphism, a third party developer cannot extend the product notification system to notify something other than a user.

Using a database

The above observations lead to a simple conclusion: to achieve robust persistence of closures, it is necessary to allow polymorphic behavior in terms of source file to be added, and in terms of code to be executed. This implies that the database will have to store both a list of source files to be loaded from disk when the persisted closure is called, and a description of what code should be called. This means either providing a single function name, or a piece of PHP code accompanied by a serialized list of the closure contents.

Despite the obvious danger of storing executable PHP code directly in a database, I will suggest going with that option as a preliminary step, possibly using a specific user for modifying closures, and giving all other users only the right to SELECT and DELETE from the closures table. SQL pseudocode for the database would be as follows:

CREATE TABLE `closures` (
    `id` INT NOT NULL AUTO_INCREMENT,
    `exec` TEXT NOT NULL, -- PHP code to be run.
    PRIMARY KEY(`id`));
-- To insert a new closure and get the coresponding ID:
INSERT INTO `closures` (`exec`) VALUES ('@exec')
-- To extract a closure's data
SELECT `exec` FROM `closures` WHERE `id` = @id
-- To drop an unused closure
DELETE FROM `closures` WHERE `id` = @id

Creating a closure involves four arguments: the list of source files to include to be able to run the closure (this will be prepended to the PHP code stored in the database), the formal parameter, the source code to be executed, and the list of arguments stored in the closure (this will be serialized by value, so no references are allowed here). Again, in a pseudocode fashion:

function create_persistent_closure($include, $args, $code, $data) {
  foreach($include as $file) 
    $src .= "require_once('$file');"
  foreach(array_values($args) as $id => $arg) 
    $src .= sprintf('$%s = function_get_arg(%d);', $arg, $id);
  foreach($data as $var => $value)
    $src .= sprintf('$%s = unserialize(\'%s\');', $var,
                    addcslashes(serialize($value), '\\\''));
  return add_to_database($src . $code);
} 

function get_persistent_closure($id) {
  return create_function('', get_from_database($id));
}

The return value of get_persistent_closure is a callable function which will execute the stored code. Note that due to the limitations of serialization in PHP, the list of arguments stored in the closure will generally be a set of identifiers and indices used to retrieve the actual objects to be manipulated (from factories and singletons). For instance:

// When registering an observer (first HTTP request)
$closure = create_persistent_closure(
    array( 'core/models/user.php' ), 
    '$product_id', 
    'UserFactory::Get($user_id)->SendProductNotification($product_id);', 
    array('user_id' => $user_id)); 

Store::Get($store_id)->AddObserver($closure); 

// When notifying observers (second HTTP request)
foreach ($this->closures as $id) {
    $func = get_persistent_closure($id);
    $func($product_id);
    drop_from_database($func); 
} 

$this->closures = array();

Using code generation

The above example, while full of good ideas, is insufficient. The main reason is, of course, safety: anyone with write access to the appropriate database table can cause arbitrary code to be executed, and this is easy if an SQL injection vulnerability exists. By contrast, storing the source code as files on disk is safer: write access to the disk already bears the risk of executing arbitrary code simply by uploading it. By restricting PHP-implemented uploads to a specific directory that is kept separate from the persistent closure cache, one can ensure the safety of the code.

By using source files serialized to disk, however, one runs the risk of multiple access to the same file. Therefore, it is advised that the file is manipulated using locking primitives to avoid those issues.

I will not provide the complete implementation here. However, the basic idea behind this implementation is to provide a directory containing the persistent closures. The closures are saved to relative paths within that directory, of the form ‘module/params/reason’, for instance ’store/1013/onProductRestock’. The closures are stored inside the event handler that will be using them, thereby improving performance when several closures are used by the same handler (thus avoiding multiple loads).

The typical usage would be:

// When registering an observer (first HTTP request)   
Store::Get($store_id) -> GetHandler() -> Register(
    array( 'core/models/user.php' ),  
    '$product_id',  
    'UserFactory::Get($user_id)->SendProductNotification($product_id);',  
    array('user_id' => $user_id));  

// When notifying observers (second HTTP request)  
$handler = $this -> GetHandler();
$handler -> Call($product_id);
$handler -> Clear();

// Inside the GetHandler function
return new EventHandler("store/" . $this->store_id . "/onProductRestock");

Generating a PHP View

Creating PHP Views

In the Zend Framework (a recent and high-quality framework for PHP 5), a view is a way of specifying how the data associated to the underlying data model should be represented, in terms of HTML code. An example of view file would be the following:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 
<html>
  <head>
    <title><?=i18n('website title')?></title>
     <?foreach($this->meta as $equiv => $value){?>
       <meta http-equiv="<?=$equiv?>" value="<?=$value?>"/>
    <?}?>
  </head>
  <body>
    <div class="navig"><?display($this->menu);?></div>
    <div class="content"><?display($this->content);?></div>
    <div class="footer"><?display($this->footer);?></div>
  </body>
</html>

I won’t go into the details of explaining why this is a good idea (that is a subject for a Dynamic Wednesdays article, after all), but the overall idea is that it separates the display information (the HTML layout) from the data itself (the various member variables of the $this object, meaning that several layouts can be used for the same object and present things differently.

Why, however, does this belong in Functional Tuesdays? What isn’t obvious above is that this view was (aside for the beautification of adding colors, line breaks and indentation) generated by a piece of Objective Caml code, illustrated below:

open Html
let doc = document
  ~head:[
    title (i18n "website title");
    foreach "$this->meta" "$equiv" "$value" [ meta (var "$equiv") (var "$value") ]]
  ~body:[
    div ~a:["class", "navig"] [ show "$this->menu" ];
    div ~a:["class", "content"] [ show "$this->content" ];
    div ~a:["class", "footer"] [ show "$this->footer" ]]

An HTML generator generator

The underlying idea is very simple: one uses the Html module to write a description of an HTML document, which may include inline PHP code such as that used by a view. Then, the program is compiled and run, and outputs the described document, which may be uploaded to an appropriate PHP web server.

What is the point?

  • Elementary safety: unlike PHP, the Objective Caml is statically compiled. This means that a lot of verifications (such as checking for existing functions and variables) can be done automatically to detect such errors. By contrast, one would have to test all the possible paths in a view template to determine that no error happens.
  • Additional verifications: although those are not readily done by the OCaml type system, it’s possible to perform static checks on the generated code. This includes generated XHTML code validation (examining the generated XHTML tree against a provided DTD), automatic string literal escaping, and even some analysis of the inline PHP code.
  • Reflection: it’s possible to extract a simplified tree that displays all the arguments that were provided to show, “class” or i18n, allowing the automatic construction of a piece of documentation (required $this fields), CSS wishlist (used classes) and translation tables (used internationalization labels).
  • Factoring: since OCaml has a much more expressive set of semantics than a typical view template, it becomes easier to define frequently recurring subexpressions (such as a mandatory wrapper around a block) into a simple function.
  • Automatic generation: if a compiler is using PHP as a target language, then this module can be used as a generator.

Implementation fundamentals

How is this implemented? The simplest way is simply to have functions return a string representing the HTML they generate. Later on, this return type could change to a product type containing both the generated string and various other information for extraction of verification purposes. The fundamental functions used here are:

let flatten xl = concat "" xl
let attrs l    = flatten (map (fun (n,v) -> " "^n^"=\""^v^"\"") l) 

let tag t a i  = sprintf "<%s%s>\n%s\n</%s>\n" t (attrs a) (flatten i) t
let ctag t a   = sprintf "<%s%s/>\n" t (attrs a)
let ptag c     = sprintf "<?%s?>\n" c

These functions allow the creation of well-formed tags. Since one wishes to allow the usual “a tag contains a list of tags” approach frequently in XML, a good idea is to use a flattening function to turn a list of strings into a single string, and consider the contents of a tag to be a list of strings. Attributes are handled as name/value pairs and accordingly printed out where applicable.

Once these functions are provided, it’s easy to write helpers for elementary XHTML tags:

let div        ?(a = []) x = tag "div" a x
let ul         ?(a = []) x = tag "ul" a x
let ol         ?(a = []) x = tag "ol" a x
let span       ?(a = []) x = tag "span" a x
let li         ?(a = []) x = tag "li" a x
let p          ?(a = []) x = tag "p" a x
let blockquote ?(a = []) x = tag "blockquote" a x
let a          ?(a = []) x = tag "a" a x 

let title x  =  tag "title" [] [x]
let meta e c = ctag "meta" ["http-equiv",e; "content",c]
let i18n s   = ptag ("=i18n('"^s^"')")

Here, optional arguments are used to provide attribute lists when necessary, without forcing them to be used if not necessary. The same can be done for the classic PHP control statements:

let var e = ptag ("="^e)
let run e = ptag (e^";")
let show o = ptag ("display("^o^");")
let ifthen c i = ptag ("if("^c^"){") ^ flatten i ^ ptag "}"
let ifelse c i e = ptag ("if("^c^"){") ^ flatten i ^ ptag "}else{" ^ flatten e ^ ptag "}"
let foreach s k v e =
  ptag (sprintf "foreach(%s as %s => %s){" s k v) ^ flatten e ^ ptag "}"

For these two code sets, OCaml will check the syntax of the generator program, which will implicitly test the syntax of the generated program for validity (as far as brace and tag balancing are involved). Of course, variable existence or initialization issues are not necessarily solved (this will be done in future situations).

Ordinal Serialization

Serialization

Serialization is the process of turning an in-memory non-linear data structure into a sequence of bytes that can be used, at another place or another time, to construct the original in-memory data structure. It’s the equivalent of putting the data structure in a box before moving to a new house (or to keep it in the basement for later use).

Today, serialization is a solved problem. Most programming languages provide serialization as a standard library feature (PHP’s serialize, OCaml’s Marshal module) or as a common extension (C++’s boost::serialization). These packages solve countless problems encountered when serializing arbitrary data: handling collections, handling polymorphism, handling object graph joins or cycles, and even the low-level issues of binary serialization, such as endian-independent memory layouts. There is, as such, no reason to use your own serialization method without first trying out these methods.

Ordinal serialization

These methods, however, have the issue of being as general as possible. As such, they often generate serialized data that is too large when compared to the actual data being stored. This is of course perfectly normal: without prior knowledge of the data set, an algorithm cannot be optimized for either size or speed. For instance, if the programmer knows that the object graph is a tree (no joins or cycles), the serializer will still have to place additional data in the stream to tell the deserializer that no joins or cycles were encountered.

A such example of overhead applies to indexes. In general, indexes are numbers between zero and the size of a container, and they are used quite a fair lot in data structures. So, if the container has a size of 172, all indices fit in a single 8-bit byte, yet will be represented by most binary serialization systems as a four-byte or eight-byte value in memory, leaving it to a compression algorithm down the pipeline to compress the three unnecessary “zero” bytes away.

If ordinals are assumed to be close to zero with a higher probability, then a possible optimization is to pack as many ordinals in a single byte, then in the following byte, and so on. The memory representation thus uses, for each byte, 7 bits to represent the value and 1 bit to determine whether this is the last byte of the ordinal or if another byte exists. This method also has the advantage of eliminating any endianness issue, since it uses individual bytes.

The code

The piece of code below provides two functions for writing and reading an ordinal (represented as an unsigned number) from an iterator. The iterator’s value_type should be unsigned char (or be compatible with it). Possibilities would be an std::back_insert_iterator< std::vector<unsigned char> > or an std::ostream_iterator< unsigned char > over the appropriate output stream.

namespace ordinal
{
    template <typename It>
    void put(unsigned value, It &target)
    {
        while (value > 127)
        {
            *target++ = static_cast<unsigned char>(value % 128) + 128;
            value /= 128;
        } 

      *target++ = static_cast<unsigned char>(value);
    } 

    template <typename It>
    unsigned get(It &source)
    {
        unsigned char read;
        unsigned multiplier = 1;
        unsigned result = 0;
        while ((read = *source++) > 127)
        {
            result += (static_cast<unsigned>(read) - 128) * multiplier;
            multiplier *= 128;
        } 

        result += static_cast<unsigned>(read) * multiplier; 

        return result;
    }
}

This is a fairly common trick, encountered in countless file format specifications (take a peek at some of the binary formats for 3D models at wotsit and see how they represent endian-independent indices). The code above is hereby placed in the public domain, where allowed by law.

PHP Closures

A closure, as explained in the corresponding wikipedia article, is a computer language feature that allows a function defined within another function to use all the variables in the function that contains it. The typical usage of a closure is to create a lambda (anonymous) function that is used by an algorithmic function over a collection. In other words, this is what would be nice (but impossible right now):

$sum = 0;
array_walk($array, function ($x) { $sum += $x; });

While this may not seem as something very interesting in this specific case (after all, the version based on for_each is equivalent), it does get more interesting once you start to work on more complex data types that are not fully described by arrays or hash maps. Consider a form class with automatic validation of its fields:

$db = connect_to_database();
$labels = get_i18n_resource();
$form -> AddField('mail', 'input', function ($x) {
    if (mail_exists($db, $x)) return i18n($labels, 'invalid mail');
    return null;
});

Doesn’t PHP provide closures?

No, it does not. However, some features do come close. The most obvious is the create_function function from the standard library, which is available since PHP 4. It allows the creation of an anonymous function with the provided arguments and code:

$new = array_map($old, create_function('$x', 'return $x * 2;'));

However, the created function does not capture its environment. While it’s possible to access a global variable, or even to hardcode a constant into the body of the function, referencing local variables of the creator is impossible.

It should be noted that several workarounds have already been provided by clever authors. This one provides a way to extract the environment variables directly into a piece of code (it does not, however, include the possibility to pass arguments). This one uses a trick based on replacing positional parameters of the form $_1, $_2 … $_n.

Last but not least, PHP 5.3 has true closure support, although the local context variables have to be imported into the closure manually (more information here).

The solution

Since I don’t expect the excellent solution from PHP 5.3 to be widely supported for another while (my current hosting doesn’t even go beyond PHP 4, for what it’s worth), I want to propose my own solution which mixes the idiomatic syntax of create_function with the principle of importing values from the context manually. Another constraint is that the code should be clean and small. The expected usage is:

$sum = 0;
array_walk($array, create_closure('$x', '$sum += $x;', array( 'sum' => &$sum)));

That is, the function behaves like create_function, but takes a third argument that binds variable names in the code expressed as a second argument. The variables are bound to the provided values by reference, allowing modification of the context in addition to simple reading.

The implementation I use involves a global variable (which can be accessed from PHP4 anonymous lambdas) to provide an anonymous function with the context (by means of a prelude that loads the data from the globally available context into the appropriate variable names). The code is as follows:

function create_closure($args, $code, $context)
{
  global $create_closure_data; 

  $nth = count($create_closure_data);
  $create_closure_data[] = & $context; 

  $prefix = 'global $create_closure_data;';
  foreach ($context as $var => $ignored)
    $prefix .= sprintf('$%s = & $create_closure_data[%d][\'%s\'];',
                       $var, $nth, $var); 

  return create_function($args, $prefix . $code);
}

Additional checks (mostly, testing each variable name to check that it’s a valid variable name) are fairly easy to add if required. This code is also hereby placed in the public domain, where allowed by law.

Immutable Sparse Grid

Sparse grids

Consider a bidimensional grid filled with values. The most optimal way of storing such a grid in memory is to use an array: a large span of contiguous memory big enough to contain all the values one after another. Accessing the value at coordinates (x,y) is done by accessing the index (x * height + y) of that array. This is how C and C++ handle multi-dimensional arrays (either the native static ones or those provided boost::multi_array).

Consider now a mostly empty bidimensional grid: one where less than 10% of the values exist (the others being filled with NULL, None or whatever does-not-exist option value you choose to use). By storing such a grid as a contiguous array, more than 90% of the allocated space would be wasted for nothing. This situation calls for a different data structure: a sparse grid. Sparse data structures are implemented in such a way that they use less memory when few values exist in a grid, at the cost of a small computing overhead and a large memory overhead when the grid is nearly full.

This article will provide a simple sparse grid implementation based on the concept behind kd-trees.

And now, a word from our sponsor

The chosen implementation is immutable: this means that once a grid is constructed, it cannot be modified. What is possible, however, is to create a new grid from the old one, with slight differences (such as one more element or one less element). Immutable data structures are to functional programming what mutable data structures are to imperative programming. Where an imperative-style program would write:

 data[i] = new_data_value;

A functional-style program would write:

let new_data = set old_data i new_data_value

Therefore, a functional-style program gets the new data structure while also keeping the old version. This has very clean applications to exception safety and transactional operations, since rolling back is as simple as throwing away the new value. Plus, it gets all the advantages of referential transparency.

But doesn’t all this copying around create problems both in terms of memory usage and in terms of runtime performance? Not really: as will be obvious in this article, the overhead related to functional manipulations on tree and list structures is minimal. In terms of Objective Caml code, the interface of our grid will be the following:


type 'a grid
val empty : int -> int -> 'a grid
val size : 'a grid -> int
val get : 'a grid -> int -> int -> 'a
val set : 'a grid -> int -> int -> 'a -> 'a grid
val unset : 'a grid -> int -> int -> 'a grid

The empty function creates an empty grid of predetermined dimensions. The size function returns the number of elements in a grid, and get returns the element found at a given position in a grid (or raises Not_found if not set). The set function returns a copy of the grid containing a certain element at a certain position, while the unset function returns a copy of the grid with no element at that position.

The algorithm

Before moving on, I strongly suggest that you get yourself familiar with the notion of kd-trees (for instance, from this wikipedia article on the subject). The key idea is that every node of a kd-tree splits the universe in two halves, with each half being represented by one of the two sub-trees.

Given an empty grid, we choose the virtual tree structure as follows: one splits the grid in two halves of equal size across its longest dimension, then recursively splits the two halves, until the obtained grid contains exactly one tile. However, this is a virtual structure: representing it in memory would be wasteful, since most of the leaves will be empty anyway. Therefore, the program will only display the parts of the tree that are important: the parts of the tree that contain no elements in their leaves are simply represented by a ‘hole’ value, instead of containing the full set of nodes and the empty leaves.

Preliminary definitions

To achieve this, it’s first necessary to define a few preliminary types and functions. This starts with the tree type, which describes the various slicing steps to reach a certain grid element.

type 'a step =
  | Split   of 'a step * 'a step
  | Element of 'a
  | Hole

This tree contains no position information: indeed, that information is already available as part of the virtual tree (which was not constructed, but which can be computed on the fly from the width and height of the grid). Therefore, traversing this tree looking for a position will involve passing the width and height around. To factor out all the traversal code can be done with a function :

let seek ~ifl ~ifr ~ifh ~ife (w,h) (x,y) step =
  let aux (w,h) (x,y) = function
    | Split (l,r) ->
        let t = w/2 in
        if x < t then ifl (t,h) (x,y) l r
        else ifr (w-t,h) (x-t,y) r l
    | Element e   ->
        ife e
    | Hole        ->
        Lazy.force ifh
  in if w >= h
  then aux (w,h) (x,y) step
  else aux (h,w) (y,x) step

This function looks at a node in the tree and returns a certain result based on what type of node it is. In the case of an existing node with a left son and a right son, it compares the sought after position with the coordinates of the separator between the children, and then calls the expected function (ifl for a left child, ifr for a right child) with the dimensions of the child grid and the coordinates within that grid. In the case of a leaf node, it calls the expected function (ife for an element) on the element itself. In the case of a grid node that’s missing (because there are no populated children), it returns the expected value (ifh for a hole). The latter is a lazily evaluated value, because one might wish to raise an exception (or otherwise avoid to evaluate it) only if it’s actually used.

The clever part of this function is the x-y swap transform. Indeed, the tree remains the same if you swap the x and y coordinates (as well as the corresponding dimensions of the grid). Since splitting always occurs along the longest edge of the grid, swapping the values to ensure that the width is always larger than the height allows the factoring of the splitting code.

Using this helper function, it’s possible to write three specific helper functions for getting, setting and unsetting:

let rec get_aux dim pos step =
  seek
    ~ifl:(fun dim pos step _ -> get_aux dim pos step)
    ~ifr:(fun dim pos step _ -> get_aux dim pos step)
    ~ifh:(lazy (raise Not_found))
    ~ife:(fun x -> x)
    dim pos step 

let rec set_aux dim pos step x =
  seek
    ~ifl:(fun dim pos l r -> Split(set_aux dim pos l x, r))
    ~ifr:(fun dim pos r l -> Split(l, set_aux dim pos r x))
    ~ifh:(lazy (
      if dim = (1,1) then Element x
      else set_aux dim pos (Split(Hole,Hole)) x))
    ~ife:(fun _ -> Element x)
    dim pos step 

let rec unset_aux dim pos step =
  let clean = function
      Split (Hole,Hole) -> Hole | other -> other
  in
  seek
    ~ifl:(fun dim pos l r -> clean (Split (unset_aux dim pos l, r)))
    ~ifr:(fun dim pos r l -> clean (Split (l, unset_aux dim pos r)))
    ~ifh:(lazy (Hole))
    ~ife:(fun _ -> Hole)
    dim pos step

The get algorithm is fairly straightforward: always go into the side which contains the sought coordinate, raising an exception if it’s a hole and returning the value if an element is found. If inlined, this function also happens to be tail-recursive.

The set algorithm digs into the tree looking for the coordinate it’s trying to set. If it encounters a hole, either the hole is small enough (and it is then filled with an element) or it’s too large and the algorithm creates a new split node at that position and digs further.

The unset algorithm searches for the correct element and, once found, replaces it with a hole. It then moves back up the tree, replacing split nodes with two hole children with a single hole node, thereby removing unused areas of the grid from the tree.

Glue code

Once the helper functions are implemented, only a little glue is required to implement the module signature:

type 'a grid = { d : int * int ; s : 'a step } 

let test g x y =
  let w,h = g.d in
  if x < 0 || x >= w || y < 0 || y >= h then
    invalid_arg "Out of bounds" 

let empty   x y   = { d = x, y ; s = Hole }
let size  g       = size_aux 0 g.s
let get   g x y   = test g x y ; get_aux g.d (x,y) g.s
let set   g x y v = test g x y ; { g with s = set_aux g.d (x,y) g.s v }
let unset g x y   = test g x y ; { g with s = unset_aux g.d (x,y) g.s }

A test function ensures that the requested coordinates are valid, and the functions merely forward their arguments to the appropriate helper function.

The end result

Let us look at overhead versus a plain array. First, assuming that the grid is not sparse (all values are provided), then the resulting tree would contain the entire set of values, plus that many leaves, plus that many internal nodes, which weight much heavier

Of course, if the system is sparse, then the reverse is true. With zero elements, there is almost no memory usage. With a single element, the amount of memory used is logarithmic in the area of the grid, and this increases sub-linearly with the number of stored elements. This means that, if storing 100 elements in a 100×100 grid, the used memory would be below 1300 nodes worth of memory (and increasingly lower as the elements were clustered together), as opposed to the 9900 useless grid locations in a non-sparse implementation.

What about modifications? Again, adding or removing an element in a 100×100 grid results in at most 13 new nodes being created (since all the other nodes are not modified, and can thus be reused as-is from the previous grid).

How does this compare to the core Map module of Objective Caml ? It depends. On the one hand, the depth required is a function of the grid area, instead of the number of elements, making the sparse grid marginally more memory-hungry for very small amounts of values (but those would be better stored as an associative list anyway). On the other hand, the sparse grid is always balanced (never requiring the need for balancing like Map does) and does not need to store the keys, since those are made implicit by the slicing. Finally, the sparse grid tends to be adapted to clustered situations which reduce the effective depth of branches, whereas a Map is adapted when the grid elements used are uniformly spread out.

Options and References

The problem

Option semantics and reseatable reference semantics are orthogonal. C++ does not implement all combinations:

Reseatable Not reseatable
Optional The reference can be changed to something else, including nothing.

C++ : T* or boost::optional<T&>
Objective Caml : ‘a option ref

The reference cannot be changed, but it may reference nothing.

C++ : T* const or const boost::optional<T&>
Objective Caml : ‘a option

Not Optional
The reference can be changed to something else, but not nothing.

C++ : not implemented.
Objective Caml : ‘a ref

The reference cannot change, and always references something.

C++ : T&
Objective Caml : ‘a

A common issue in C family languages is that option semantics are usually intermingled with reseatable reference semantics (be it pointers in C and C++, or typical references in Java and C#).

The solution

C++ provides a simple compile-time way of ensuring that a given referenced object exists: reference types have no option semantics and can therefore be used to guarantee that the manipulated value is fixed. Therefore, it’s possible to propose an optional type based on pointers, which verifies that the manipulated object is never null.

template <typename T> class safe_ptr
{
    T* data;
public:
    explicit safe_ptr(T& i) : data(&i) {}
    safe_ptr &operator =(T& i) { data = i; } 

    // Default copy and assignment semantics are correct 

    T& operator *() const { return *data; }
    T* operator->() const { return  data; } 

    bool operator == (const safe_ptr& other) const { return data == other.data; }
    bool operator != (const safe_ptr& other) const { return data != other.data; }
};

Example usage:

std::string a = "A", b = "B"; 

safe_ptr<std::string> p = a;
std::cout << *p; // Outputs 'A' 

p = b;
std::cout << *p; // Outputs 'B'

The issues

This solution is not perfect. First, it fails to interact correctly with inheritance, so a safe pointer to a derived class cannot be implicitly converted to a safe pointer to a base class, and the conversion has to be done manually and explicitly using safe_ptr<Base>( *ptr_to_derived ). Second, it also fails to provide iterator semantics, although that would be expected of a mutable pointer type. And third, it relies on the user’s ability not to dereference null pointers to pass them to the class’ constructor.