Archive for the 'Functional' Category

OCaml Submodule Pattern

My code is quite large for an OCaml project. The main RunOrg repository alone contains 46212 lines of OCaml code (plus an additional 5631 lines of OCaml mli files) — and then, there’s the web framework code and the independent plugins code.

It’s is Better™ to have many short files than a few long ones. One reason is incremental compiling with ocamlbuild : that the smaller your files are, the smaller the percentage of code to be compiled when you make a small change. Another reason is that files provide a natural delineation of code that makes it slightly easier to reason about.

The very process of splitting a large file into smaller files is also an excellent way to clean up the code. Every split is an opportunity to move some code to a more generic location — why have a CMember_importParser module when all of its functionality could fit into an OzCsv plugin module ? Even when no such generic solution exists, cutting through the jungle that a 2000-line module contains helps clean up dependencies, identify shared functionality and imagine better ways to design code.

Still, when cutting up code this way, the problem of encapsulation remains. If code that relates to pictures (an upload module, a transform module, a download module, an access rights module) is split across several files, it is desirable to let each file access functions and values from other values that would not otherwise be shown to modules not related to picture processing. For instance, a get_download_link function should be available throughout all picture-related modules, but the rest of the application should use the get_download_link_for_user function that checks whether the user is allowed to download the file.

In order to achieve several nested levels of encapsulation required to work with modules this way, I have come up with a convention :

  • A module name (and thus, a file name) is composed of segments written in camelCase and separated by underscores. For instance, CEntity_view_grid is a module name containing segments CEntity, view and grid.
  • Modules with only one segment are public. Any other module may include, open or otherwise reference them with no limitations beyond what the module signature says. So, CEntity may access MGroup freely.
  • Modules with N > 1 segments are private. They may only be accessed by modules which share the first N-1 segments. So, CEntity_view is available to modules CEntity and CEntity_edit but not CPicture.
  • A module with N segments may export any module with N+1 segments it can access, possibly under a more restrictive signature. For instance, CEntity_view is available to all other modules as CEntity.View.

To make these rules easier to respect, private module dependencies are made explicit by adding a list of module aliases at the top of each file. The top of my cEntity_view.ml file starts with :

module Sidebar     = CEntity_sidebar
module Unavailable = CEntity_unavailable
module Edit        = CEntity_edit
module Info        = CEntity_view_info
module Directory   = CEntity_view_directory
module Grid        = CEntity_view_grid
module Wall        = CEntity_view_wall

It is forbidden to use a private module without going through such an alias, and it is forbidden to define such an alias anywhere except at the top of the file. This makes it extremely easy to determine whether private access rules are respected.

The rule of thumb for splitting files (in my particular coding style) is :

  • Code for separate layers (model, view, controller…) go into separate public modules.
  • For complex code (such as complex rules in model or controller code), consider splitting files larger than 200 lines.
  • For simple code (such as HTML template or JSON serialization definitions), there is no splitting limit except for factoring out common behavior.

Two-way bindings

Quick : how would you design a function that opens a file handle that auto-closes whenever execution leaves a certain scope, even if an exception happens ?

The C++ solution is quite straightforward : have a destructor that closes the file handle, and create the handle as an auto variable:

std::ofstream out = std::ofstream("out.txt");
out << whatever();
// file handle is always closed by the language

On the other hand, OCaml does not have destructors, but can simulate the RAII spirit using a closure: you provide a function that will be called with the file handle as its argument, and the file handle will be destroyed when the function returns or raises an exception.

BatFile.with_file_out "out.txt" begin fun out ->
  BatIO.nwrite out (whatever ())
  (* the file handle is always closed by the language *)
end

This is a special case of a more general principle.

Two-way binding

The standard let keyword performs a one-way binding: bind value to variable then evaluate expression. Two-way binding adds a post-processing step : when you’re done with the expression, do something else. Such a behavior has important consequences for writing concise and readable code.

In my OCaml code, two-way binding is performed with keyword let! that is preprocessed as follows :

let! pattern = value in expression
(* Is translated to *)
value (fun pattern -> expression)

For instance, the above file manipulation script would be written as:

let! out = BatFile.with_file_out "out.txt" in
BatIo.nwrite out (whatever ())

This syntax expresses the actual intent of the code better than the anonymous callback syntax did: bind the file handle to this variable, but don’t forget the post-processing steps.

Here are a few more examples of situations that may be improved by this syntax :

Events and reactive programming

Reactive programs can be constructed either using the typical “register this function to be called whenever this value changes or this event happens” semantics, or  by using binding semantics instead:

let () =
  let! user = User.on_change (#last_login) in
  if user # notify_login then
    Mail.send (user # email)
      ("Someone has logged in to your account at " ^ datetime (user # last_login))

The underlying signature of User.on_change (which registers a listener callback and returns unit) remains the same.

Retry semantics

CouchDB implements transactions with retry semantics: you read a document, compute some changes and try saving them back, and if the document was changed by someone else in the mean time, you will have to try again. It makes sense for the code inside the transaction to be 1° idempotent and 2° wrapped away in a function that 3° takes the latest version of the document as an argument :

let set_title article_id new_title =
  let! article = Database.transaction article_id in
  Database.write article_id { article with title = new_title }

In such a design, the write function would throw a specific exception if a collision occurs, and the transaction function would intercept that exception and try again until the transaction succeeded or a maximum number of retries happened.

Monads

Value binding in monads benefits from having a syntax that actually looks like binding.With the option monad, one can turn this :

match Files.get file_id with None -> None | Some file ->
  match file # owner with None -> None | Some user_id ->
    match Users.get user_id with None -> None | Some user ->
      Some (user # name)

Into a more straightforward version :

let  open BatOption.Monad in
let! file    = bind $ Files.get file_id in
let! user_id = bind $ file # owner in
let! user    = bind $ Users.get user_id in
return (user # name)

Also, one can deal with Lwt threads almost as well as the Lwt-specific syntax extension:

open Lwt
open Lwt_io

let process_lines channel process =
  let loop () =
    let! line_opt = bind $ read_line_opt channel in
    match line_opt with
      | None -> return ()
      | Some line -> loop () <&> process line
  in
  loop ()

Being Silly

let fold init list f = List.fold_left (fun acc x -> f (acc,x)) init list
let map list f       = List.map f list

let probabilities odds =
  let sum =
    let! accumulator, odd = fold 0. odds in
    accumulator +. float_of_int odd
  in
  let! odd = map odds in
  float_of_int odd /. sum

The Syntax Extension

In case you don’t know how to create it, this is the preprocessor file for this syntax extension :

open Camlp4.PreCast
open Syntax

EXTEND Gram
 GLOBAL: expr;

 expr: LEVEL "top"
 [
   [ "let"; "!"; p = patt ; "=" ; e = expr ; "in" ; e' = expr ->
     <:expr< (($e$) (fun $p$ -> $e'$)) >> ]
 ] ;

END;

By the way, I find that this extension has a significant advantage over the Lwt extension – it is readily compatible with syntax highlighting in most editors.

Basic Patterns for Everyday Programming

Lakshen Perera provides a list of basic patterns for everyday programming, illustrated in Javascript and Ruby. I thought it would be interesting to provide an OCaml illustration as well, and perhaps a handful of additional patterns as well.

Verify object’s availability before calling its methods or properties

In many languages, there is a possibility for objects to be missing — whether this is represented as NULL, null, nil or None. Regardless of the language, it is important to keep in mind at all times whether a given value is optional or not. If it is not optional, then you may assert so using a type declaration if your language supports it :

method name  : string (* not optional *)
method email : string option (* optional *)

In other languages, a runtime assert will do, for instance in PHP :

assert (isset($this->name))

OCaml will not require you to explicitly declare all values as optional or mandatory. Instead, it will deduce that information from the way you use each value. For instance, since json_of_string expects a non-optional string argument, you can simply write :

let parsed_content = json_of_string json (* json is not optional *)

If json is an optional string, this will create a compilation error unless you explicitly define what should happen when the string is missing. The most common approach is to decide that the result should be missing too :

let parsed_content = match json with
  | None   -> None
  | Some s -> Some (json_of_string s) (* by definition, s is not optional *)

If using the Batteries library (which, by the way, you should), you can express this more easily :

let parsed_content = BatOption.map json_of_string json

Set a default value with assignments

This advice is almost identical to the previous one : if you need to construct a non-optional value, but only have access to an optional value, you will have to provide a default value. Using standard OCaml :

let role = match person.role with
  | None      -> `Guest
  | Some role -> role

Using Batteries, there is an almost equivalent version :

let role = BatOption.default `Guest person.role

This version is almost equivalent, because it will evaluate the default value even if it is not required. Let’s assume that the default value is itself provided by a complex computation, such as a database access :

let role = match person.role with
  | None      -> readfrom database (* only executed if person.role is missing *)
  | Some role -> role 

let role = BatOption.default
  (readfrom database) (* Always executed, even if unnecessary *)
  role

The Batteries library provides an alternate function for that :

let role = BatOption.map_default
  readfrom database (* only executed if person.role is missing *)
  role

Checking whether a variable equals to any of the given values

If the values have a legitimate reason to be strings, integers or other types with unlimited numbers of values, then « is in list » predicates are the preferred choice :

if List.mem current_day ["Monday";"Wednesday";"Friday"] then
  (* do something *) 

Days are a quite bad example, because these are better represented as a variant (which then type-checks whether you have written Mornday instead of Monday). If you have already defined the type of weekdays, then :

if List.mem (current_day : weekday) [`Monday;`Wednesday;`Friday] then
  (* do something *)

If this is a one-shot check, or if you would rather not define a weekday type yet, you should instead go for an exhaustive pattern-matching :

 match current_day with
  | `Monday | `Wednesday | `Friday ->
    (* do something *)
  | `Tuesday | `Thursday | `Saturday | `Sunday -> ()

Extract complex or repeated logic into functions

This is a fairly fundamental concept — but it consists in two distinct parts. There is a separation and naming part (you pull out a piece of code and give it a name, which helps understand what it does and how it relates to the rest of the program) and there is an extraction and reuse part (the piece of code is pulled out into a more globally accessible location and parametrized, so that it may be used in other places).

With the above example, simple separation-and-naming would be :

let is_discount_day = match current_day with
 | `Monday | `Wednesday | `Friday -> current_date > 20
 | `Tuesday | `Thursday | `Saturday | `Sunday -> false
in

if is_discount_day then 
  (* do something *)

The variable is defined in the same scope it is use in, and it assumes that current_day and current_date values have been defined previously in that scope. Extraction-and-reuse would go further :

type weekday =
  [ `Monday | `Tuesday | `Wednesday
  | `Thursday | `Friday | `Saturday | `Sunday ]

let is_discount_day (day:weekday) date =
  List.mem day [`Monday;`Wednesday;`Friday] && date > 20

...

  if is_discount_day current_day current_date then 
    (* do something *)

Now, is-discount-day is a global function available from everywhere in the code, and it uses the provided parameters to determine whether this is indeed a discount day.

Memoize the results of repeated function calls

OCaml has several ways to perform memoization. One of them is lazy evaluation :

val discount_day = lazy (is_discount_day current_day current_date)
method discount_day = Lazy.force discount_day

The lazy expression will only be evaluated the first time the Lazy.force function is called on it.

Note that if the current day or current date can change, then the memoization actually breaks things !

Memoization is also helpful when dealing with a function that requires arguments, in which case a different result will be provided for each argument set. A common solution is to use a hash table to store these :

let fibonacci =
  let memo = Hashtbl.create 100 in
  let rec fib n =
    try Hashtbl.find memo n with Not_found ->
      let result = fib (n-1) + fib (n-2) in
      Hashtbl.add memo n result ; result
  in fib

This works fine for short-lived functions — don’t do this for global functions that might stick around for a long time, because the memoization hash table will grow and its contents will never be garbage-collected. If you really have to, use a weak hash table, such as Batteries’ BatInnerWeaktbl, so that the garbage collector may reclaim the memoized values when it runs out of memory.

Also don’ t overdo memoization — it only works when arguments are reliably passed more than once and the time to compute the value is significantly larger than the time to retrieve and store it and it is worth the memory usage and the function has no side-effects.

Use the seven list manipulation primitives

Almost any processing on collections of items can be expressed in terms of seven fundamental patterns. Recognizing those patterns can help improve the clarity of both the code and the underlying algorithm.

1. Map transforms a list into another, item by item, in linear time. Use a map operation when all you need is a one-to-one transformation. The line below extracts three recipes from the database using their identifier.

let recipes = List.map from_database [ "omelet" ; "cheeseburger" ; "risotto" ]

2. Reduce transforms a list of values into a single value by repeatedly applying a function that combines together two values into one. The typical example is a fold, which uses a function to combine each list element, in turn, with an accumulator. It can be used to extract the sum of values in a list, for example :

let total = List.fold_left (+) 0 [ 5 ; 6 ; 3 ; 8 ; 9 ; 0 ; 7 ; 6 ]

This transform allows a preliminary map step which transform the values inside the list into values that can be combined. For instance, to find the age of the oldest person in a list of people :

let oldest = List.fold_left (fun acc person -> max age person.age) 0 people

3. Extract works like a map, but the transformation function returns zero, one or more results for each call. All the results are included in the final list. The most elementary implementation is literally to have a map (that transforms a list into a list of lists) followed by a concatenation (that transforms a list of lists into a list). For instance, to get all the ingredients involved in a list of recipes :

let ingredients =
  List.concat (List.map (fun recipe -> recipe.ingredients) recipes)

4. Filter is a subset of Extract where the transform may not return more than one result — but it may still return none, so its result is simply an optional type. For instance, to extract the list of all recipes that have a wine recommandation along with their recommended wine :

let wines = BatList.filter_map
  (fun recipe -> BatOption.map (fun wine -> recipe,wine) recipe.wine) recipes

The OCaml language also provides a standard List.filter function which keeps values for which a property is true. For instance, to get the list of recipes that have a wine recommendation :

let have_wines = List.filter (fun recipe -> recipe.wine <> None) recipes

This approach is weaker — the wines in the resulting list are still treated as optional by the type system, so you will need bogus pattern matching for a case that never happens (no wine) to extract the actual wine values. A filter_map lets you encode the filter property in the type of the result, which makes using the filtered list easier.

5. Sort unsurprisingly sorts the list. The canonical sort — using the canonical order relationship — is a theoretical curiosity, and in practice most sorts use a projection function p such that A < B iff p(A) < p(B). This is best illustrated in SQL, in the form of the ORDER BY <projection> statement. Two useful helper functions :

let project compare p a b = compare (p a) (p b)
let descending compare a b = compare b a

For instance, to sort the list of recipes based on how long each of them takes :

let by_duration =
  List.sort (project compare (fun recipe -> recipe.duration)) recipes

6. Group works like Sort, but further regroups « equal » items together by returning a list of lists. It works using a comparison function that is usually based on a projection function. For instance, to get three lists containing one-star, two-star and three-star recipes :

let by_stars =
  BatList.group (project compare (fun recipe -> recipe.stars)) recipes

There is a special case when the projection function returns booleans, which is known as a partition. For instance, to extract recipes that are desserts and recipes that are not :

let desserts, non_desserts =
  BatList.partition (fun recipe -> recipe.is_dessert) recipes

7. Search extracts one element from the list (if possible) based on a certain condition or property. Elementary searches are « first element » and « last element. » More complex searches : finding a value by key in a list of key-value pairs using List.assoc or using a predicate using List.find. The heavy-duty search tool is BatList.find_map, used below to find a recipe that is recommended by at least one person, and the recommending person :

let recipe, recommender = BatList.find_map
  (fun recipe -> BatOption.map (fun person -> recipe, person) recipe.recommended)
  recipes

These seven patterns can be used in conjunction to perform almost any algorithm on collections, sequences or lists. For a more complex example, assume we need the ten ingredients that appear the most often in desserts. We would filter recipes by desserts, extract their ingredients, group them by name, sort the sub-lists by list length and take the name of the first element of the first ten sub-lists :

open BatPervasives (* for operator |> *)
open BatList

let ten_best_ingredients recipes = recipes
  |> filter (fun r -> r.dessert)
  |> map (fun r -> r.ingredients) |> concat
  |> group (project compare (fun i -> i.name))
  |> sort (descending (project compare length))
  |> take 10
  |> map hd
  |> map (fun i -> i.name)

Article © brewbooks — Flickr

Comment Branches

Your development job is making changes in your software. Writing, testing and debugging those changes takes some time.

If your job is anywhere as hectic as mine, you will have to fix and deploy urgent patches, even when your application code is in a half-written, half-debugged state because of the feature of the month.

This is what branches are for. You keep two versions of the code, one of which is called the trunk and is always ready for deployment, and another which holds the changes that you are working on.

When your feature is done, you merge the two versions together. You want to keep the merge operation painless. To do so, you have several kinds of branches available.

The repository branch is built into your SourceSafe/subversion/git/whatever. It creates two independent copies, and you need to migrate changes from the trunk to every branch out there as soon as possible, or the merge will make you wish for a sweet and merciful death.

By the way, changeset-oriented tools (like git or mercurial) make this easer, while revision-oriented tools (like subversion) make it harder.

The feature branch is done using programming logic. The code you deploy to production supports the new feature, but it is turned off for everyone except yourself. This technique is great for adding features, but inefficient when changing existing ones.

A side effect of the feature branch is that you can stress-test new code by rolling it out to increasing numbers of users progressively.

The comment branch is an odd gambit. It involves ripping out an entire module and replacing it with another that has a different interface. This will involve large amounts of re-wiring all over the code base, and these will take hours or days before they can be compiled, let alone tested.

Use a comment structure such as this one:

/*[*/ old code /*|* new code *]*/

It is trivial to build a text-replacement macro that turns the above into the code below and back:

/*[* old code *|*/ new code /*]*/

Use the macro to switch between development mode (when you write new code and desperately try to get it to compile) and fix mode (when you edit the old code and deploy it). For consistency, always commit the old version to the repository.

Why use comment branches instead of repository branches ? Maybe your source control tool sucks at branches. I use Subversion. Yes, I know. Legacy, pain and unlikely hopes of a brighter future.

When a trunk change occurs in a part that has been erased or reworked in the branch, that change will cause a conflict that will require manual intervention. Even with git or mercurial. For a large number of small changes sprinkled over a large codebase that is routinely involving many small updates, repository branches turn into a merge minefield.

Does your branch involve a small number of well-defined files ?

Then you should use repository branches, because conflicts will only happen in those files, and will usually be easy to fix.

Does your branch involve many changes in many files everywhere in the project ?

Then use comment branches.

Last and possibly least, there is the TODO-branch. This involves non-breaking, purely cosmetic changes. 25% of my project uses this syntax for historical reasons:

Table.get id |-> function
   | None       -> return 0
   | Some value -> return value.count

Then, a convention change happened, and this is used instead:

let! value_opt = breathe (Table.get id) in
match value_opt with  
   | None       -> return 0
   | Some value -> return value.count

Then, another convention change happened, and this should be used instead

let! value = breathe_req_or (return 0) (Table.get id) in
return value.count

And then, there’s the current version:

let! value = breathe_req_or (return 0) $ Table.get id in
return value.count

Whenever I change coding conventions, I do not spend the time to reformat the tens of thousands of lines of code in my application. That would have been wasteful. Instead, every time a piece of code is refactored, it is refactored to the most recent style.

The same happens when using an old and a new version of a given API. My code uses two libraries for handling HTML forms, uses both Javascript and Coffeescript, and a variety of similar two-hammers-one-nail situations.

These are, for all practical purposes, branches. They are work that is being performed for long durations. The benefit of TODO-branches is that code in the middle of such changes is still compatible with the trunk. It all happens in the head of the developer, who remembers what changes should be done the next time a piece of code is rewritten.

Article Image © Dominic Alves — Flickr

Annealing Constraints

Of all the existing constraint-solving algorithms out there, my favourite is simulated annealing. To use it, you need a mutation function that randomly and slightly alters a solution, and a fitness function that lets you compare two solutions. Based on these, the algorithm is simple :

  1. Start with an arbitrary solution.
  2. Mutate the solution : if the mutated solution is better than the current one in terms of fitness, keep it and repeat this step.
  3. If you discard too many mutated solutions in step 2, remember the current solution and mutate it.
  4. When time runs out, return the best solution found so far.

Step 2 is a fairly standard ascent towards a local maximum : try a random similar solution and keep it if it improves fitness. Step 3 detects that the algorithm is stuck in a local maximum, so it shakes the solution out of there and starts again from that new position.

What is interesting about this solution is that it is exceedingly easy to implement, easy to adapt to a problem (as long as mutation and fitness are possible), and is reasonably efficient for most small problems.

I used it today to solve an original problem on RunOrg. We have a history of recently connected users, and we wanted to display a simple metric along the lines of « 10 users connected in the last hour » or « 11 users connected this week » as a short summary on the administrators’ dashboard. The problem, of course, is to pick a reasonable duration based on the current data ! If one user connected seven days ago, and ten users connected a few minutes ago, then both « 10 users connected in the last hour » and « 11 users connected this week » are correct descriptions of the situation, but the former is intuitively better than the latter. Intuitively, we should use the shortest duration with the largest amount of users.

And intuition is a bad thing, because it is hard to turn into an algorithm.

In the end, I settled for a handful of periods of time : the last 5 minutes, the last 1 to 6 hours, today, the last 2 or 3 days, and this week. Each of these periods of time contains a certain number of user connections (which is fairly easy to determine based on the connection history) and receives a score computed by multiplying the number of user connections with a magic constant that is tied to that period. So, if the magic constant for « the last hour » is 0.76 and that for « this week » is 0.03, then 10 × 0.76 = 7.6 and 11 × 0.03 = 0.33 meaning that the « 10 users connected in the last hour » version will be preferred.

Then, I came up with several examples of connection histories, and manually picked the intuitive duration that should be chosen to display them. This turned into the fitness function : any set of magic numbers would have to agree with as many of those examples as possible, the more the better. Mutation, on the other hand, consisted in changing one of the magic constants at random.

Within 2000 iterations and a few seconds, the annealing algorithm came up with a solution that satisfied 94% of my constraints — and I am not even certain whether 100% could be satisfied, since I might have left a few contradictory intuitions in there. The final output is:

== Fitness: 94.44%
5min : 1
1h : 0.761091036722064
2h : 0.42295812451487896
3h : 0.26899204507851066
4h : 0.20923702855563558
5h : 0.14858146007043194
6h : 0.14846461447836096
today : 0.1477566411417444
2days : 0.07381054881381345
3days : 0.04202811963812667
week : 0.033268828300871064

And so, these are the magic numbers that our algorithm uses. If it makes wrong decisions, we can always run the annealing again with new constraints, and feed the magic numbers back in.

Article image © Detlef Schobert — Flickr

Ozone Templating

There have been recurring requests about an in-depth explanation of how Ozone — our in-house OCaml web framework — handles HTML templates. So, here it is.

A template is usually understood by everyone to be « HTML with holes » that is filled using values from the application itself. It is, in a sense, a DSL that is restricted to describing how HTML should be built.

Here is an example of template Ozone could use, stored in file users.htm :

<h1>{t:users.title}</h1>
<ul class="userlist">
  {{list:
    <li id="{v:id}">
      <img src="{v:img}"/>
      <a href="{v:url}">{v:name}</a>
    </li>
  }}
</ul>

<script type="text/coffeescript" params="save_url">
list = @$.find 'ul'

save = =>
  ids = [];
  list.children('li').each ->
    ids.push $(@).attr 'id'
  @ajax save_url, ids

list.children('li').sortable
  change: save
</script>

<style type="less">
.userlist {
  list-style-type: none;
  li img {
    float: left;
    margin-right: 5px;
    width: 50px;
    height: 50px;
  }
}
</style>  

The template for a sortable list of users contains three things:

  • A piece of HTML, which is the actual « HTML with holes » to be filled later. The holes are marked in orange.
  • A piece of CoffeeScript, which will be extracted from the template file, compiled to javascript and appended to a site-wide javascript file. It will be replaced, in the template, by a hole that will call the extracted javascript with additional parameters provided by the application (in orange).
  • A piece of LESS CSS, which is compiled to CSS and appended to a site-wide CSS file.

These are not sections — they can appear in any order as long as the elements and attributes are respected so the pre-build tool can identify and extract the CoffeeScript and CSS bits.

Let’s examine each of them in order.

The HTML Template

This is the meat of the template. In order to improve application performance, loading the templates is a multi-step operation that involves intermediary storage formats.

The first step consists in reading in all the necessary templates, parsing them to determine that no variables are undefined, and storing them as a JSON blog in the underlying CouchDB database. This is a manually triggered operation that happens whenever we modify the templates (it’s part of our deployment procedure). This step may also involve a bit of cleanup, such as removing semantically irrelevant spaces from the HTML (this cannot be done earlier, because some templates are plaintext instead of HTML, and only the application knows which is which).

The second step happens whenever a new instance of our application begins — maybe it died and needed to restart, maybe Apache decided it needed another worker process to handle a surplus of request, or maybe we added a new server to our web farm. The startup process of our application server does not read anything from the disk — instead, it will read in all the template data from the database, along with all the other bits of configuration: internationalization strings, third party API keys, feature branch triggers, and so on. Then, it will compile every template down to optimized closure-based opcodes for a hole-filling virtual machine.

The third step happens whenever a bit of HTML needs to be rendered. The application provides the hole-filling virtual machine with a data object and a « writing stream » which is either the HTTP request stream or a JSON serializer stream, depending on whether the request is normal HTTP or AJAX. This is an extremely fast operation where no parsing or checks are performed.

On the application side, loading a template involves three things:

  • Declaring the type of the data object expected by the template.
  • Declaring the source file from the template (as a function of the language).
  • Declaring the hole-to-value mapping  to be used.

Here’s that loading code for the above template file:

module User = Loader.Html(struct
  type t = <
    id   : Id.t ;
    url  : string ;
    img  : string option ;
    name : string
  > ;;
  let source  _ = "users/list"
  let mapping _ = [
    "id",   Mk.esc (fun x -> Id.to_string (x # id)) ;
    "url",  Mk.esc (fun x -> x # url) ;
    "img",  Mk.esc (fun x -> BatOption.default img404 (x # img)) ;
    "name", Mk.esc (fun x -> x # name)
  ]
end)

module UserList = Loader.Html(struct
  type t = <
    users : User.t list
  > ;;
  let source  _ = "users"
  let mapping lang = [
    "list", Mk.list (fun x -> x # users) (User.template lang)
  ]
end)

One view is defined and loaded for every independent piece of HTML in the template. Here, there is an User view which represents the list item for a single user, repeated zero, one or more times ; and there is the UserList view representing the wrapper in which those list items will be placed.

The {v:foobar} syntax defines a variable hole. The corresponding view MUST define a mapping for that variable, or an error will occur at deployment time.

The {{foobar: }} syntax is a variant: in addition to declaring a variable hole, it also defines such a sub-view, which can be loaded using template/foobar as the path.

The {t:foobar} syntax defines a translation hole. The template engine will automatically load the corresponding term from the internationalization dictionary used to render the template.

The Mk.esc and Mk.list are binding instructions which are used to compile the template to a virtual machine. The common binding instructions are:

  • Mk.esc f applies f to the data object, which returns a string. That string is then HTML-escaped and output.
  • Mk.str f is the same as above, but the string is not HTML-escaped.
  • Mk.i18n f is the same as above, but the string is translated as an internationalization term.
  • Mk.list f t applies f to the data object, which returns a list of data objects compatible with template t. That template is then used to render those data objects in order.
  • Mk.list_or f t e is the same as above, but if the returned list is empty, it instead uses template e to draw a « list is empty » message.
  • Mk.sub f t applies f to the data object, which returns a single object compatible with template t. That template is then used to render the object.
  • Mk.sub_or f t e is the same as above, but f returns an optional type. If it is missing, then template e is used to render an « object is missing » message.
  • Mk.text f provides f with the current writing stream and internationalization object, so that it may directly write HTML to the output. This is how most rendering helpers such as « render a currency amount » are used.
  • Mk.box f is the same as above, but the writing stream supports the addition of arbitrary javascript code to be executed by the client as part of rendering the template. This is how javascript-dependent rendering helpers such as « render a datepicker » are used.

The data type is defined in the view itself, either explicitly (as I did above for the sake of clarity) or by using an existing type from your application — if the application already had an user module with the appropriate data type, I could have used that type instead.

By specifying views in this way, the data required to render a template is made available to the compiler for type-checking, and missing bindings are detected during deployment (usually to a local test server). This has made template-related errors exceedingly rare — once the HTML is done, it becomes extremely hard to use it wrong.

Although this feature is not currently in use, the virtual machine semantics also allow compiling it down to JavaScript. This would allow us to send the rendering code to the client as a one-time cost, and send a much smaller data package through AJAX whenever something new needs to be rendered.

The CoffeeScript Layer

We use CoffeeScript because it’s more elegant, shorter, and includes a compiling-to-javascript step that lets us detect syntax errors at deployment time. Yes, compile- and deployment-time are my favorite buzzwords, because I enjoy the feeling of safety that they bring.

As mentioned above, the actual CoffeeScript is removed from the template in a pre-processing step, and replaced with a hole that says « call JavaScript function #33 now » that happens to define a list of parameters matching the params attribute of the original script element.

So, starting with the script element from the example above:

<script type="text/coffeescript" params="save_url">
list = @$.find 'ul'

save = =>
  ids = [];
  list.children('li').each ->
    ids.push $(@).attr 'id'
  @ajax save_url, ids

list.children('li').sortable
  change: save
</script>

If this is the 33rd script tag encountered by the preprocessor, then it would append the following to the complete CoffeeScript file:

@j33 = (save_url) ->
  list = @$.find 'ul'

  save = =>
    ids = [];
    list.children('li').each ->
      ids.push $(@).attr 'id'
    @ajax save_url, ids

  list.children('li').sortable
    change: save

And it would be replaced in the template file with this:

{j:j33:save_url}

This syntax (which can be used manually, although it should be avoided) is a javascript hole, it runs the specified function and provides by-name values for the arguments. The parser would notice that we are declaring an HTML view instead of a JS/HTML view and complain about it, so we would have to go back and re-define it:

module UserList = Loader.JsHtml(struct
  type t = <
    users : User.t list ;
    save_url : string
  > ;;
  let source  _ = "users"
  let mapping lang = [
    "list", Mk.list (fun x -> x # users) (User.template lang)
  ]
  let script  _ = [
    "save_url", (fun x -> Json_type.String (x # save_url))
  ]
end)

I have used Loader.JsHtml instead of Loader.Html, and defined a secondary mapping that is specific to JavaScript parameters, and which uses the data object to return JSON values.

How is the JavaScript called? Well, it really depends on how your JavaScript library handles it. On non-AJAX HTTP, Ozone will try to inject all JavaScript calls in a script element at the end of the HTML body. In AJAX mode, Ozone allows you render a template to a JSON object representing both the HTML and the JavaScript together, and it is the responsibility of the code that made the AJAX request to receive that object, place the HTML wherever applicable, and then “run the JavaScript”.

By convention, the JavaScript is called using a client context as itsthis value. The client context is an object which may contain whatever the caller finds interesting to place there, along with a variable named $ which should be a jQuery selection containing the root element of the previously rendered HTML. Hence, @$.find 'ul' would select the list in the rendered HTML, instead of all the lists on the page.

The LESS CSS Layer

This is the least interesting of all three layers. The LESS CSS code is extracted, appended to a single file, and compiled to CSS (which, again, is an useful deployment-time syntax check). The point of this feature is simply to let the designer place element-specific CSS next to the element, instead of having it exist in an external file and cause trouble with asset garbage collection (can I remove this rule or is it still used anywhere?) External files still exist, though, for CSS rules that are not limited to a single template.

Bonus : the triple hash

How do I define some code that should be called when a button is clicked? Defining it directly in the onclick method is ugly, hard to read and does not let the application provide parameters, so what else can I do?

The solution is to use an intermediary global object that happens to be the same for the entire file — a pattern that stores any template-related JS in a global variable named after __FILE__ !

Yes, it is a hack, but it’s a simple and useful one.

The only difference is that __FILE__ is spelled ###.

<button type="button" onclick="###.frobnicate()">{t:frobnicate}</button>

<script type="text/coffeescript" params="message">
###.frobnicate = ->
  alert message
</script

Article Image © gdbg12 — Flickr

Let’s Whine About Google Dart

What’s wrong about 17259 lines of code ? asks Andrea Giammarchi, following the publication of a gist that shows an 18-line Hello-World application written in Dart, the new Google-sponsored browser-based language, and the corresponding 17259 lines of JavaScript that the Dart source code would be compiled to.

Yes, 17259 lines of code is quite terrifying, and reading through the code does make one ponder what those engineers at Google were smoking and how they can ever expect such a language to overtake JavaScript — or even manage to be used in the first place.

And then, there’s a loud WHOOSH: the sound of a point flying so high over the collective heads of Dart critiques that it will probably collide with some garbage in low earth orbit.

The point is that this is version 0.01 of a language designed for optimization through static analysis.

I’ll be the first to admit that there is no way for a Hello-World application to execute all those 17259 lines. This is known as dead code: pieces of functionality which might be useful to some applications but not to others, which is included because it’s easier for the developer, but which will never run.

I can understand why Andrea is upset. Dead code is a big no-no in JavaScript, because it has to be downloaded and parsed by every user, which costs bandwidth and slows down everything. No, wait. Let me correct that last statement. Dead code is a big no-no in JavaScript because the JavaScript language makes it impossible to have automatic dead code elimination.

In JavaScript, you can have a global function named foobar which is not even once referenced in the entire code base, and it would still be unsafe to remove it because at some point the code fetches a funcname variable through AJAX and executes window[funcname]() — and it just so happens that the server returns "foobar" on the night of the full moon. In short, there’s no way in hell an automated tool can remove a single line of code from JavaScript without breaking at least one possible program. That’s just the way JavaScript is — objects are key-value containers with string keys.

Of course, you could decide to follow rules that prevent such things from happening. There’s at least one convention, for instance, which is used by minifiers to try and be a little more than glorified whitespace removers: variable renaming. If you have a local variable in JavaScript, you can rename it and all its occurences and the change will never be visible outside the scope it was defined in because the variable itself is not visible outside. This lets you rename aQuiteLongVariableName to g, which makes your code lighter. And it breaks because one of your function uses eval on an arbitrary string provided by the server and which references aQuiteLongVariableName. So, the convention is « do not use eval » but it’s still a convention, not a language specification, so it has to be opt-in and clearly document its requirements.

Dart was designed not to have these issues, so it does not matter if version 0.01 of the Dart-to-JavaScript compiler fails to perform dead code removal: it can be implemented, and Google engineers would indeed be fools not to implement it before the 1.0 release.

There are also questions to be asked about the actual contents of those 17259 lines — when they will be used, will it make sense?

One of the first things that I notice is, as Andrea mentioned, the various global binding functions such as:

function $bind1_0(fn, thisObj, scope) {
  return function() {
    return fn.call(thisObj, scope);
  }
}

This is actually an extremely important optimization, but not related to speed — it’s about memory. To understand why, look at this code:

var div = document.createElement("div");
var message = "Hello";
div.onclick = function(){
  alert (message)
};

The anonymous function here is a closure: it must carry the variable message with it in order to access it when it is called. In any sane language, the interpreter would store a reference to that variable in the closure, and be done with it. JavaScript cannot (again, because of eval) so it has to store the entire scope that the function was defined in, which includes both message and div. In some versions of IE, having such a circular reference between a DOM element and a JavaScript value creates a memory leak which leads to massive performance degradation, but even in better browsers than IE, the function will keep this div alive for longer than is really necessary, thus consuming memory for nothing.

Dart has the ability to perform this scope analysis and determine that only the message needs to be included. The truckload of bind functions help implement this by utterly eliminating the JavaScript closure:

function onclick(message) {
  alert(message);
}

var div = document.createElement("div");
var message = "Hello";
div.onclick = $bind1_0(onclick,div,message);

Another critic from Andrea was the massive amount of wrappers around elementary functions, such as setting an array element to a value or adding two values together, which should bring JavaScript-from-Dart programs to a screeching halt becuse of all the involved overhead.

They won’t. Trust me. This isn’t the first time I have seen such wrappers, and I know what they are and why they exist. They are generic adapters — fallback operations to be used when the static analysis step cannot determine precisely the types of the objects involved.

Quick. Translate the following code from Dart to JavaScript by hand in an optimal manner:

swap(a,i,j) {
  var t = a[i];
  a[i] = a[j];
  a[j] = t;
}

Chances are, you answered something like this:

function swap(a,i,j) {
  var t = a[i];
  a[i] = a[j];
  a[j] = t;
}

You would be wrong. You are under the mistaken impression that, because it uses the []-syntax, the Dart function works with arrays when in fact it works with any object that defines that access operator.

So, on your second attempt, you would answer something like this:

function swap(a,i,j) {
  var t = ugly_getter.apply(a,i);
  ugly_setter.apply(a,i,ugly_getter.apply(a,j));
  ugly_setter.apply(a,j,t);
}

And you would be wrong again. The key property of Dart (as well as other languages which encounter the same problem — OCaml, C#, C++, Java… anything with generic polymorphism actually) is that it allows static analysis. Static analysis lets you determine what the types of those variables might be. Is the swap function ever called on something other than an array? Then only define version one. Is it called on many things, but most often arrays? Define both functions, and use the specific array-compatible one whenever you are certain an array will be passed. Is it never called? Remove it…

This kind of analysis is reasonably simple to perform, and has the benefit of being able to create both highly-optimized code for primitive types and short generic code for abstract types. And you can even help the compiler by annotating types.

Again, it does not matter whether Dart 0.0.1 actually performs these optimizations. What matters is that they are both possible and easy given the language design, and I expect them to be present in release 1.0

I am by no means a Google zealot or secret Dart admirer. In fact, from what I have seen so far, I would sooner loathe Dart than admire it — a bastard type-checking that relies on warnings? A bland java-like syntax for brace weenies? No actual support for immutable data structures? Blech. Give me JavaScript CoffeeScript anyday. I’ll switch to Dart when it takes over the world.

But the current crapstorm is just ridiculous. Come on, the guys at Google are pushing out a 0.0.1 proof-of-concept and you’re complaining that their optimizer isn’t acceptable? Please. I’m not even sure there’s an optimizer yet. From the language specs, it’s fairly obvious that many optimization avenues unavailable to JavaScript will be available to Dart, the only question is whether release 1.0 will have them or not. Wait and see — criticizing the optimization of a 0.0.1 compiler is just silly.

Article Image © Erich Ferdinantd — Flickr

Functional Programming

Yesterday evening, I was present at Paris Hackers, a meetup for Hacker News enthusiasts — and a «hacker» in this context is not someone who breaks into software systems, but someone who is passionate beyond words about how software works. But the definition is vague, and that vagueness is part of my surprise.

We had a little chat on the topic of recent buzzwords — concepts that people were interested in, but not everyone understood. One of them was Functional Programming. The speaker asked who knew enough about it to explain to the others, and I naturally raised my hand: I was taught functional programming formally, I taught it myself, I implemented a functional programming language, and I am using functional programming professionally on a daily basis.

My surprise was that I was the only one. I had sort of assumed that a majority of «hackers» would know what functional programming is. I will have to revise that definition.

« Having Functions »

Functional programming itself has a definition problem: depending on whom you ask, you can get two different definitions on functional programming. One definition is what I would call pure functional programming, and the other is better summarized as « having functions »

Of course all languages have something they call «functions» but I strongly believe that there is a list of minimum requirements before you are allowed to say that.

  1. You should be able to define a FOOBAR anywhere. At global scope. Inside a class. Inside another function. Inside an expression. It can be an anonymous FOOBAR, or it can be given a name, depending on the situation.
  2. You should be able to store a FOOBAR in a variable, pass it as an argument to a function, and return it from a function. If your language is strongly typed, it should have an easily described type as well.

Replace FOOBAR with «integer» or «string» and you will notice that without these requirements one can hardly say a language «supports integers» or «supports strings» — what if you could not store integers in variables, or write string literals as part of expressions?

These requirements are called «having FOOBARs as first-class citizens» by the way. In C for instance, functions and arrays are second-class citizens, and integers are first-class citizens. In JavaScript, functions are a first-class citizen.

Allowing functions helps make a language a lot more expressive. Let’s think of a quick example: performing an HTTP request and displaying the result in a window. In pseudocode that looks shamefully like JavaScript code, it would look like this:

function display(url) {
  var page = http(url);
  return new Window(page);
}

This function has an obvious limitation: if it takes five seconds to load the data through HTTP, then the program will freeze for five seconds and only then display the window. A better solution would be to make the request asynchronous: a window in a «loading» state is returned immediately, and a background process takes care of the HTTP requests and fills the window with the contents once they have been received.

A clean way of encapsulating this would be to let the http function handle the asynchronous behavior. The calling code would only need to provide a description of what needs to be done once the contents have been received. There are two ways of doing this in an object-oriented world.

Object-oriented solution #1: inherit from an http request class.

class GetPageRequest extends HttpRequest {

  // We need to know what window to put the contents in
  function GetPage(window,url) {
    super(url);
    this.window = window;
  } 

  // Override this function to react to the data being received
  function onSuccess(page) {
    this.window.setContents(page);
  }
}

function display(url) {
  var window = new Window(loading);
  var request = new GetPageRequest(window,url);
  request.runAsynchronous();
  return window;
}

Object-oriented solution #2: implement an «HTTP success event handler» interface.

class OnReceivePage implements IHttpRequestSuccessHandler {

  // What window do we need to fill ?
  function OnReceivePage(window) {
    this.window = window;
  }

  // Handle success (as previous example)
  function onSuccess(page) {
    this.window.setContents(page);
  }
}

function display(url) {
  var window = new Window(loading);
  http(url, new OnReceivePage(window));
  return window;
}

Both solutions are quite long, because object-oriented solutions involve a certain amount of syntactic overhead. In both cases, the only code that is actually relevant to our work is the contents of the onSuccess function. What if the language allowed us to define that function directly, and passing it to the HTTP request without any syntactic fuss?

Functional solution #1: local function definition.

function display(url) {
  var window = new Window(loading);
  function onSuccess(page) { window.setContents(page); }
  http(url, onSuccess);
  return window;
}

In fact, why bother with giving the function a name? Its contents are obvious enough, so it would be shorter and cleaner to just passing it to the HTTP request without any name.

Functional solution #2: anonymous function definition, or «lambda»

function display(url) {
  var window = new Window(loading);
  http(url,function(page) { window.setContents(page); });
  return window;
}

The latter example is syntactically correct JavaScript code, by the way. As you can see, the code is both shorter and more readable than the class-based version. In general, using inheritance or interface implementation to define a single method is always significantly more wasteful that using an anonymous function.

There’s a third requiremend that I left unvoiced. Look back at the previous examples again. Notice how the class-based examples have to store the window object explicitly, but the functional versions use the variable directly? This is known as «closures» : when defining a function inside a function, the defined function actually carries with it all the variables that were present when it was defined. Functional languages always have closures, because any non-trivial operation you might wish to place in an anonymous function will involve accessing data that was present near the definition.

In summary: functional programming involves the ability to manipulate functions as you would manipulate any other data type (define anywhere, store, pass, return…) and allows writing more concise programs whenever you need to pass, store or return some behavior to be executed later.

Examples of functional languages:

  • JavaScript
  • Lisp (including Common Lisp, Scheme …)
  • ML (including SML, OCaml, F# …)
  • Haskell
  • Ruby
  • C#

Examples of near-functional languages:

  • PHP only supports lambdas and closures since 5.3, and closure syntax is suboptimal.
  • Python has 99% support, the only issue is that only one-line lambdas are allowed.
  • XSLT 2.0 has no anonymous functions, but is otherwise functional. I kid you not.

Examples of non-functional languages:

  • Java
  • C
  • C++

Pure Functional Programming

This is the other definition of «Functional Programming» that you can hear, and the confusion is understandable: languages that match definition #2 have long been the only examples of definition #1 as well.

The relevant concept is purity: can the impact of a function on the rest of the program be described only in terms of its return type, and can the impact of the rest of the program on the function be described only in terms of its arguments?

Consider the signature of an interface method:

public int frobnicate(Foo f);

There are many ways in which that function can affect the rest of the program:

  • It could call a global function to perform changes in the global state, such as altering a singleton, setting a global variable, printing data to the screen, sending data over a socket…
  • It could change member variables of the object it is called on.
  • It could call methods on its argument.
  • It could return a new value that is then used by the calling code.

There are also many ways in which the rest of the program could affect the behavior of the function:

  • It could depend on global variables to decide what it should do.
  • It could read global state (user input, files, sockets…)
  • It could access the data of the object it is called on.
  • It could access the data of its argument.

The entries marked in bold can be deduced from the method signature: having a return type means you want to return something, being non-static means you wish to read the state of the object you are called on, having an argument means you want to read data from that argument. The other entries are possible, but not readily apparent in the signature.

Pure functional programming is the principle of least surprise: sure, those entries that are not in bold are possible, but it would be surprising if they happened, so they are in fact not allowed. A pure function can only read data from its arguments and can only write data by returning it.

Also, pure functions imply that data is immutable — if functions are not allowed to change anything, then they are not allowed to change data structures, so the data structures are by definition immutable.

By the way, pure functional programming is all about what you cannot do — so you could in theory write functional programming in any language just by following coding conventions to that end. In practice, immutability is quite hard to respect if there is no syntax for making it easier — you spend a lot of time creating manual copies of objects in order to change a single field.

Working with pure functions and immutable data structures involves both benefits and trade-offs when compared to non-pure programing. Think of it as a slider which goes between «0% Pure» and «100% Pure» and you pick one value for your entire program. Many algorithms and concepts get easier to understand, change and manipulate as the slider moves towards the 100% end, but there are some concepts that are intrinsically non-pure and actually get harder and harder to think about as your slider moves away from the 0% end. For instance, working with a persistent data store is an intrinsically mutable endeavor, and it’s hard (but possible) to think about it in pure terms.

I find that a slider value of 95% in a language that makes such a value possible is the optimal place to be — the 5% includes writing code that is truly non-pure (such as database manipulation, responding to HTTP requests…) as well as code that is non-pure on the inside but is actually pure from the outside (such as memoization and caching).

Article Image © Tim Olson — Flickr

Using OCaml from CouchDB views

What follows is directly taken from my latest GitHub project, which provides an adapter for transforming OCaml applications into CouchDB view servers. The programmer writes an OCaml application that exports one or more map and reduce functions using the API found in module CouchAdapter, and creates a CouchDB design document that specifies the application path and the name of the exported functions. The adapter server then receives evaluation requests from CouchDB and passes them to the application, and returns the result back to CouchDB.

The objective of this project is not to support writing OCaml code directly into views! The OCaml code should follow the standard build procedure, the only exception being that the CouchAdapter API is used to export that code and make it available to the adapter server.

Requirements and setup

This adapter uses json-wheel for representing JSON values, and the build process requires OCamlBuild. There are no other direct dependencies. Building the adapter server runServer is fairly straightforward: make byte or make native generates runServer.byte or runServer.native respectively. Move the resulting application to an appropriate location on your system and allow CouchDB to execute it. My suggestion is:

cp runServer.native /usr/bin/couch-ml-adapter
chmod a+x /usr/bin/couch-ml-adapter

I will be assuming this convention for the rest of this manual. Once the server is built and installed, you need to configure CouchDB to actually use that adapter to execute OCaml views. Edit the local.ini configuration file of your CouchDB server (usually found in /etc/couchdb/local.ini) and add the following lines:

[query_servers]
ocaml=/usr/bin/couch-ml-adapter

Depending on your configuration, there might already be a [query_servers] section. If that is the case, add the second line to that section. If you have trouble configuring your query servers, read the CouchDB documentation.

Errors that happen while executing the adapter will appear in the CouchDB logs (usually found in /var/log/couchdb/couch.log).

Architecture

Query Servers

The CouchDB server usually evaluates map and reduce functions only when a design document containing those functions is queried by a client, by following this process:

  • If the query server configured in local.ini is not already running, start it.
  • Send various instructions on the query server’s STDIN, such as “apply the map function F to document D”
  • Read the results on the query server’s STDOUT.

The Adapter Server

The adapter server provided by this project is one such query server. When it must apply a function to a document, it does the following:

  • Determine which application provides the function.
  • If the application is not already running, start it.
  • Send the request to the application’s STDIN, read the answer on its STDOUT.
  • If the application responds with results, send these back to CouchDB.

In short, the overall architecture looks like this:

+---------+         +------------------------+
|         | <-----> |  Haskell Query Server  |
|         |         +------------------------+
|         |
|         |         +------------------------+
| CouchDB | <-----> | Brainfuck Query Server |
|         |         +------------------------+
|         |
|         |         +------------------------+
|         | <-----> |                        | <-----> [ Application /home/nicollet/test ]
+---------+         |  OCaml Adapter Server  |
                    |                        | <-----> [ Application /usr/bin/foo ]
                    +------------------------+

The programmer should therefore write an application which reads the adapter requests on STDIN, runs the requested functions on the provided documents, and sends the results back on STDOUT. All the boilerplate involved is handled by the CouchAdapter module, so that the actual development process you will be following is:

  • Include any modules you might need to use in your view.
  • Define the map or reduce function as an OCaml function.
  • Register that function as being exported with CouchAdapter.export_map and CouchAdapter.export_reduce.
  • Call CouchAdapter.export()

Importing From CouchDB

CouchDB references map and reduce functions in design documents, using the following syntax:

{ "_id" : "_design/..."  ,
  "language" : "...",
  "views" : {
    "foobar" : { "map" : ... }
    "quxbaz" : { "map" : ... , "reduce" : ... }
  }
}

In order to use the OCaml adapter, one must first set the language property to "ocaml". Then, to reference the function "extract_foo" defined in application /usr/bin/foo, one would write:

"views" : {
  "foobar" : { "map" : ["/usr/bin/foo", 1, "extract_foo"] }
}

The same syntax applies for reduce functions as well. The three components of the definition are 1- the absolute path to the application that exports the function (this is how the adapter server knows what application to run), 2- a version number discussed in the next section and 3- the name under which the function is exported from that application.

Function versions

For performance reasons, once an application or query server has been started, it is never shut down. This only causes problems when there’s a new version of the code that needs to be deployed. The adapter server provides a versioning system which automatically detects that a function.

A CouchDB design document requests a function that is at least a certain version. For instance, ["/usr/bin/foo", 42, "extract_foo"] indicates that the adapter server should find version 42 or greater of the function "extract_foo" exported by application usr/bin/foo. If that application is currently running and the function is either missing or older than version 42 then the application is shut down and started anew in a completely transparent fashion.

Note that if rebooting the application still fails to provide an appropriate version of the function, the adapter server will report an error, which CouchDB will propagate to the client. This makes all the views inside the design document unavailable until an appropriate version of the application is deployed.

Failing to manage function versions both in CouchDB and in the application can lead to data inconsistencies, as different documents are processed by different versions of the same function. Only a global version change which prompts a full refresh of the view and reloads the application can ensure data consistency in the face of code changes.

Creating a map function

A map function must follow the signature json -> (json * json) list: the argument is the entire document being processed, and the output is a list of key, value pairs being output by the map function.

For example, suppose you already have an User module in your application, which is used among other things for reading and writing users to the CouchDB database:

type t = {
  active  : bool ;
  name    : string ;
  email   : string ;
  picture : string
}

let of_json = (* ... *)
let to_json = (* ... *)

Then you can rely on that module to define a map function with the above signature, and export it using the CouchAdapter module:

open Json_type

let user_by_email json =
  try let user = User.of_json json in
      [ String user.User.email , Null ]
  with _ -> []

let () =             

  CouchAdapter.export_map
    ~name:"user_by_email"
    ~version:1
    ~body:user_by_email ;

  CouchAdapter.export ()

Should you decide to update the view code, make sure that you also increment the version number:

open Json_type

let user_by_email json =
  try let user = User.of_json json in
      if user.User.active then [ String user.User.email , Null ]
  else []
  with _ -> []

let () =             

  CouchAdapter.export_map
    ~name:"user_by_email"
    ~version:2
    ~body:user_by_email ;

  CouchAdapter.export ()

Creating a reduce function

There is no distinction made between reduce and rereduce. While this causes a slight loss in functionality it also makes writing reduce functions less arduous given the OCaml type system. The signature of reduce functions is simply json list -> json.

For example, let’s assume that an Article module is already defined in your main application:

type t = {
  title : string ;
  html  : string ;
  tags  : string list
}

let of_json = (* ... *)
let to_json = (* ... *)

We now define a map function and a reduce function that counts how many articles are published for every tag.

let by_tag_map json =
  try let article = Article.of_json json in
      List.map (fun tag -> String tag , Int 1) article.Article.tags
  with _ -> []

let by_tag_reduce json =
  Int (List.fold_left (fun acc -> function Int i -> acc + i | _ -> acc) 0 json)

let () =
  CouchAdapter.export_map "by_tag-map" 1 by_tag_map ;
  CouchAdapter.export_reduce "by_tag-reduce" 1 by_tag_reduce ;
  CouchAdapter.export ()

And the CouchDB design document is as follows:

{ "_id" : "_design/article",
  "language" : "ocaml",
  "views" : {
    "by_tag" : { "map"    : ["/path/to/app", 1, "by_tag-map"    ],
                 "reduce" : ["/path/to/app", 1, "by_tag-reduce" ] }
  }
}

Article image © Miriam Rossignoli — Flickr

Short OCaml Diffs

I’m working on some wiki functionality right now, and one of the basic properties of a wiki is that history is kept for all pages. And this should be done elegantly, without actually keeping a copy of every version, especially since many changes are actually just small fixes. This is usually done by keeping a diff — a small piece of data that stores the changes needed to get from version A to version B. Computing an arbitrary diff is a fairly easy task, but computing a diff that is as small as possible is actually challenging, because there are many ways in which a diff can be created and only a few of these are optimal.

The standard approach to computing diffs is to find the longest common subsequence — a non-contiguous sequence of characters that is found in both versions, such that one may turn version A into the sequence only by removing characters, and then turn that sequence into version B only by adding characters. The list of additions and removals is what is commonly referred to as a diff, because this is how the UNIX program diff works.

I am not happy with this solution. For one, it works based on lines, which is fairly acceptable for source code that is cleanly split into many lines, but the average wiki page is not as clean. Instead, every paragraph tends to stay on its own line and is only word-wrapped for convenience. Using a line-based diff tool on such data would cause the removal and addition of entire paragraphs, which is certainly not optimal if the change was a minor typo fix on a single word. And since the diff algorithm itself is quadratic, it cannot be coaxed to work on characters instead of lines : a one-millisecond line-based run would take ten seconds on characters !

Another shortfall of the longest common subsequence approach is that it fails to detect sentence or paragraph swaps, yet these happen quite frequently in the rewriting stages of a wiki page. Indeed, regardless of how clever the diff algorithm is, the diff format itself provides no primitives for moving data around, only “add this” and “remove that” are supported.

I have designed and implemented a custom diff algorithm. Instead of working based on “add” and “remove” primitives, it carries a data segment where all the new content is stored, and works with “blit from old” and “blit from data segment” instructions to construct the new version from the old. Its format is optimized to be stored as JSON, and the diff application algorithm is simple enough to be handled by a short piece of JavaScript if necessary. It’s on GitHub, by the way.

For instance, consider the sentences “The quick brown fox” and “Brown foxes are quick“, and how the former can be transformed into the latter.

A longest common subsequence approach (on characters, assuming this is sane) would determine that the LCS is the seven characters “e quick“, and so it would be written as “ThBrown foxes are quick brown fox” which is indeed the shortest diff you can get using the LCS approach.

My algorithm instead determines that contiguous substrings “rown fox” and “e quick” are present in both sentences, so the diff would be written as “Brown foxes are quick“. The corresponding JSON is quite short, too:

["Bes ar",1,[10,8],5,[-12,7]]

The first element is the data segment (the five new characters), single integers are blit-from-data-segment (length) instructions, and integer pairs are blit-from-old (offset,length) instructions.

The theoretical algorithm complexity is O(m+n), the actual implementation on GitHub is worst-case O(mn) because I used lists instead of hash tables at one point, but should actually be O(m+n) on any input that happens to be written in a human language because working on a human language almost guarantees that those lists will contain zero or one elements in most situations.

The algorithm itself relies on statistical properties of character pairs to quickly match up sequences from both strings, such as by finding out that the ox pair only appears once in each sentence, or that the Br pair only appears in the new sentence. On medium-sized texts (up to 50,000 characters), there are several character pairs that are unique or missing, which helps the algorithm guess correctly where each piece came from. Once the source of the pieces is identified, generating the diff is extremely simple.



1150 feed subscribers
(readers who polled a feed this week)