Tag Archive for 'Language Design'

Touchscreen Development

A touchscreen is a variation on the traditional screen-and-mouse combination for input-output. One of its main benefits is that aiming is easier when you can just point at things with your finger, which means interaction can be faster. One of its disadvantages is that there is no scale adjustment (you can move a mouse by one inch to move the cursor by a half inch, but one inch of your finger is always one inch of your touchscreen), which leads to increasing the size of buttons, and a larger user interface means a smaller area for displaying back information.

I’ve been looking around for touchscreen-centered development environments. Kodu comes close, since it’s able to work on a television screen with an Xbox 360 controller and could conceivably be controlled through finger pointing. Aside from that, however, I haven’t really found things worth mentioning (but I can certainly have missed some).

One direction I would like to see explored in greater depth is the area of visual (or nearly visual programming languages). For instance, something like:

These screenshots contain code blocks, where each block has a darker tab that serves as its header. Some blocks are short one-liners that only contain elementary code, so they’re on the same line as their header, whereas larger block are drawn below their header. Header colors mean:

  • Dark blue: top-level definitions, these appear as part of the current module.
  • Orange: conditional blocks. All such blocks are examined in order, the first one that matches is evaluated.
  • Light blue: for-loops, these define a new integer variable that traverses a range of integers.
  • Violet: nested definitions, these define a new variable bound to a specified value, and run some code with that binding.

All other code, in the white blocks, is actual code that is evaluated. A block’s value depends on the block type, with most blocks evaluating to their last line, and loops evaluating to an imperative ‘unit’ type.

The point of using blocks is to make the various elements stand out more for interaction, so that they are easier to identify and easier to touch.  The system supports four interaction modes: short presses, drag-and-drop, area select, and long presses. It also handles three distinct modes: view, edit and refactor. A short press on a block does the default action for that block, which depends on the current mode, for instance:

  • In edit mode, touching a dark blue, light blue or violet header inserts the variable defined by that header into the currently edited code.
  • In refactor mode, touching a dark blue, light blue or violet header selects that header for refactoring (it highlights and counts all occurences, and proposes a contextual menu for moving and renaming).
  • In view mode, touching a dark blue, light blue or violet header expands the definition itself. Touching any white block collapses that block. Touching the header of a white block expands that block back, and touching an expanded header collapses that header.

Drag-and-drop can be use to move things around. In view mode, it scrolls the screen. In edit and refactor mode, it moves blocks and lines. Area selection merely selects several touched areas. the difference between area selection and drag-and-drop is that area selection begins outside any blocks. Long presses bring up contextual menus for the activated block.

Done right, I believe this could improve the productivity of programmers by moving them away from the keyboard: not only is the physical speed itself faster, but selecting from a list has proven faster and less error-prone than typing out a name. Besides, automatic naming schemes could be found, for instance adding special documentation comments that suggests names for the return values of a function or the instances of a class.

So, What’s Your Name?

I’ve been having a lot of name-related problems, lately. Not my own name, but the names of various things I have to work with. Nothing unexpected: in the computer world, we tend to use names everywhere.

Every time there are objects to be manipulated by programmers, there’s a way to give names to these objects. It could be the “id” attribute in XML, it could be the name of a file in a filesystem, it could be the name of a variable or member in a program, or it could be the hostname of a web site… A name is, in theory, a set of characters from a predetermined alphabet (that varies based on the application) which is bound to an entity of some sort. Once you hold the name, you can get the associated entity directly.

This theoretical definition leads to two distinct issues:

  • There’s only a small number of acceptable names around. Of course, generic names like “clxbpf8990″ can be used (and if you look at the ids generated by generate-id() in XSLT, this is exactly what happens) but the entire point of a name is to allow people to retrieve content based on the name. As far as the internet goes, there’s only a finite number of names any user can remember—and web users have become increasingly reliant on google and the recently added Firefox 3.0 address bar to find sites with non-obvious names. How do we handle the inevitable collisions?
  • A programmer in Australia has a global variable named Foo in a C program. A programmer in Sweden, working on a different C program, also has a variable named Foo. Should the two collide? What about machine names on local networks? User names on the same machine? Not all names are global, which means that the scope of names has to be determined, and ways of handling collisions between local scopes must be invented.

Yet, it’s not all about collision. The most important element of naming is the ability to find the data bound to a name. This is what directories are for.

The simplest form of directory is the phone directory: you have a name on the left, and a phone number on the right. The directory also happens to be sorted by ascending key order so that dichotomic search can be used to find a given name in logarithmic time (how clever—in the age of search engines, we tend to forget logarithmic search times existed since the days of dictionaries and encyclopedias). Such directories exist in the computer world. The simplest would be the classic user directory, accessed through LDAP (Lightweight Directory Access Protocol) or perhaps ActiveDirectory, but these are not the most frequently encountered by the common user. Another simple example is /etc/passwd:

username:!:100:100:office:/home/username:/usr/bin/sh

User name is on the left, various user-related information is to the right.

Domain Name System

Let’s crank the complexity lever up a little bit. What classic directory format do we use several times a day? The Domain Name System (which most of you know as DNS). Everyone sends out queries to a DNS server, asking for the IP associated with a name. That is, when you type “www.nicollet.net” in your browser, the browser needs to know what server to connect to. The problem is that looking for a server is done by routers, which are usually not very smart: they have routing tables based on masking the IP address (whether that address is IPv4 or IPv6) so they need that address to begin with. So, the browser instead resolves the domain name to an address. This involves looking at the local “hosts” file to see if a definition exists (for instance, “localhost” tends to be bound to the loopback address “127.0.0.1″), then queries any local name services the operating system may provide (this allows you to connect to an HTTP server running on another machine on your network by using that machine’s name, without having to set up a local DNS server or registering with a public one), and finally sends out a query to a DNS server somewhere on the internet (the address of the DNS server is either provided by the ISP itself, or manually entered by the user in a configuration wizard or file). The DNS then returns the address for the domain.

This is where two of my recent name problems came from. As you might remember, one week ago I had to move my blog from one server to another, and in the process, I had to change the DNS entry so that reads who typed in “www.nicollet.net” connected to the new server instead of the old one.

The first issue was that DNS propagation is not instantaneous. Back in the days when the intertubes were often clogged, caching played a big role in avoiding too many DNS queries moving up to the reference DNS server for a given top-level domain. The downside is that when you type in “www.nicollet.net”, you’re not really asking the reference DNS server for domains ending in “.net”, you’re asking the DNS server provided by your ISP, which may decide that its copy of the “www.nicollet.net” address binding is correct, without even asking the reference server (this is what caching is for). So, it took about five hours for all visitors to be correctly redirected to the new website. If any of you posted any comments, they went to the old website and were lost along with it—sorry. Of course, as you can expect, this can get a lot more problematic once you have a highly interactive website. So, you tend to choose an IP address and stick with it (or, if you have to change it, then you have it forward everything to the new one until nobody uses it anymore).

This also meant that any e-mails sent to foobar@nicollet.net were routed to the old MX binding, “mx1.ovh.net” (for those who wonder, a DNS entry contains several bindings: a web browser would look at the main binding for that domain, while a mail delivery program would look for the Mail eXchange binding) even though I was expecting them on the new MX binding, “nicollet.net”.  This took a short while to be sorted out.

The second issue was with the name of the server. See, being listed in a directory doesn’t change your name: it merely gives you a name by which you can be found. So, the name of the famous emperor of Wei is still Cao Cao, despite being posthumously named Wu. And the name of my new server still was r17474.ovh.net despite being now referred to as nicollet.net by the DNS. Then, of course, when some mail arrived for the user foobar@nicollet.net, my qmail-powered server quite naturally answered “there’s no user named foobar@nicollet.net here, please go away” in perhaps not so polite terms. So, I still had to convince my server that it was now known as “nicollet.net” so that the mail delivery could work. Nothing that the “hostname” UNIX command couldn’t solve.

What about local area networks? How come that you can access another machine on your network by using its name? There usually isn’t a central naming service which can be queried by computers on a LAN, and it wouldn’t be practical because it would have to be manually configured on every new computer. Instead, computers on a LAN use NetBIOS to set up local directories: whenever a new computer is hooked up to the network, it broadcasts its hostname and address to everyone else, and everyone else writes down the association in a local directory. Conflicts are resolved quite simply: if two computers have the same name, the last to broadcast is the one that everyone else remembers (this is quite useful if you have to move from an office to another and get a new IP as a consequence).

Names in Languages

Programming languages all provide the user with ways of naming entities that are significant, both for documentation purposes (it’s easier to understand what a value is when it has a reasonable name) and for cross-referencing data defined in one place and used in another.

On the one hand, you have the static compile-time approach. This is the easiest to work with by far, because all the nitty-gritty details are written down in the documentation of the compiler and the build scripts of your application. For instance, most compilers allow the definition of “include paths”, a list of locations on the filesystem where the definitions of objects can be found. When the compiler needs something (an included file in C or C++, a class in Java or Actionscript, a module in Objective Caml), it will look for an appropriately named file in all these locations.

Let’s consider the Objective Caml compiler, “ocamlc”. Whenever a source file contains a reference to another module (by using a symbol in a module context, such as “ModuleName.member” or “Functor(ModuleName)” or “open ModuleName” or something like that), the compiler looks for the compiled interface of that module. This is done by looking for the file “moduleName.cmi” (which is generated manually by running “ocamlc moduleName.mli”) in all the locations configured with the compiler: first, the current directory, then paths specified through the environment variables, then include paths specified with the -I command-line argument. If you’re only compiling the module (flag -c), the result is a cmo file (or cmx file with “ocamlopt”). At link-time, you must then specify all the cmo files required by the program, and the compiler resolves all links based on the name of the cmo file (for simplicity, it’s possible to group together several cmo files as a single cma file : they are simply concatenated together, so using the cma library is equivalent to adding all the cmo files it contains manually).

Java goes a short step beyond this by eliminating the need for a link step. When you import a java class, using the “import abc.def.Foobar;” statement, java understands that by looking at the relative path “abc/def/” it will find either a “Foobar.class” or a newer “Foobar.java” that it can compile to “Foobar.class”. Here, the path is specified relative to one of the include paths (which Java calls classpaths) and can be a normal filesystem path or a path within a JAR archive.

C and C++ are both the simplest and the most complex. In these languages, every reference is explicit. The first step is compilation, where files are included in other files by specifying the relative path. The second step is linking, where libraries and objects are included by specifying the relative path. At each step, additional locations can be specified to look for headers.

PHP has an interesting dual approach to things. On the one hand, its normal cross-file interpretation system consists in include() and require() statements which look for the named file in any of the specified include paths for PHP, which makes it look like SH, C and C++. On the other hand, PHP has also introduced functionality for resolving class names: when a class is used (“ClassName::Member”, “Foobar extends ClassName” or “new ClassName”) but the class is not defined, a special function is called. This lets the user specify an alternative loading scheme, such as looking for a file named ClassName.php and including it. The Zend Framework makes heavy use of this, meaning that the inclusion approach is specified once in index.php (or a common configuration file) and then no other inclusion is used for class filed (arguably, Zend_View still includes phtml files explicitly).

On the other hand, there’s runtime access. This is harder to work with, because there’s less documentation available. Sure, some concepts, such as dynamic linking, are fairly well-documented mechanisms that look for the named file in the current directory first, then in other directories specified as environment variables or system-wide configuration elements (in the case of java, in the classpath, for example). However, even then, some programs insist on working on their own.

I recently had an issue with Alfresco, namely alfresco-mmt.jar which terminated with a ClassDefNotFound exception. Looking for solutions on the internet, I found out that the class it was looking for was defined in a JAR nearby, so I adapted the classpath when running the jar so that the class could be found. Except it couldn’t. The problem with this was that the exception did not specify where the class loader had been looking for the class (or perhaps it did internally, but as an end user with no access to the source code or, probably, a debugger, I couldn’t see it), so I had no way of understanding where the JAR with the class should have been placed. It turns out, the application loaded the JAR from inside its own JAR, so I had to place it there.

Looking for Minima

Annealing is a process to harden metals: by repeatedly heating and freezing metal, atoms within the metallic structure are dislodged and then moved back into place, which moves those atoms that were not part of a regular structure to more correct positions, resulting in a stronger metal piece.

The underlying process is that of energy reduction: an atom in a non-regular position within a crystalline lattice has a high potential energy, but is kept into place because it has reached a local energy minimum (so moving away requires more energy than it has). By imcreasing the available energy, the atom is allowed to move out of the local minimum, and so it ends up in a location with lower energy (with a reasonable probability, though this is by no means certain). By applying this process a high enough number of times, the global minimum can be reached.

Simulated Annealing

In computer science, that process can be applied to minimization techniques: instead of exploring the problem space completely, annealing is simulating by moving around randomly in that space. In order to move towards minima, every move which increases the value can be canceled with a certain probability that increases as time passes.

For example:

let annealing steps energy initial =
  let rec aux n best best_energy current =
    if n = 0 then best else
      let current = random_move current in
      let new_energy = energy current in
      if new_energy < best_energy then
        aux (n-1) current new_energy current
      else if Random.int steps < n then
        aux (n-1) best best_energy best
      else
        aux (n-1) best best_energy current
  in aux steps initial (energy initial) initial

This is by no means the only possible approach: other variations can involve checking whether the move is upwards since the last position (instead of the best position so far) or using a different probability approach for preventing up-moves.

Genetic Computing

A common problem found in simulated annealing and genetic algorithms is the implementation of solutions as programs: since these approaches require the ability to randomly mutate a solution, and a program is usually a fickle and weak construct that can easily die if one of its parts is forcibly removed, representing programs as randomly altered entities is difficult.

One possible solution is to consider stack based-languages: these languages have very light syntax requirements and can be represented mostly as a sequence of words. Besides, the only way for a program in these languages to die is to underflow the stack (which can be easily solved by stating that reading from an empty stack yields a reasonable value, such as zero or one. My concept was to use a language which reads a single integer variable ‘x’ and manipulates a stack ‘s’, and outputs a stack of numbers. These numbers are then multiplied together to yield the final result. The available instructions are:

let operations =
  [|
    "nop", (fun (s,_) -> s);
    "pop", (fun (s,_) -> !s) ;
    "dup", (fun (s,_) -> t s :: s);
    "div", (fun (s,_) -> (t s /. t !s) :: !!s);
    "two", (fun (s,_) -> 2. :: s);
    "mul", (fun (s,_) -> (t s *. t !s) :: !!s);
    "add", (fun (s,_) -> (t s +. t !s) :: !!s);
    "cos", (fun (s,_) -> cos (t s) :: !s);
    "sin", (fun (s,_) -> sin (t s) :: !s);
    "sub", (fun (s,_) -> (t s -. t !s) :: !!s);
    "min", (fun (s,_) -> min (t s) (t !s) :: !!s);
    "max", (fun (s,_) -> max (t s) (t !s) :: !!s);
    "avg", (fun (s,_) -> ((t s +. t !s) *. 0.5) :: !!s);
    "var", (fun (s,x) -> x :: s);
    "bra", (fun (s,_) -> (if t s > 0. then t !s else t !!s) :: !!!s);
    "zer", (fun (s,_) -> 0. :: s);
    "one", (fun (s,_) -> 1. :: s);
    "neg", (fun (s,_) -> -1. :: s);
    "opp", (fun (s,_) -> -. (t s) :: !s);
    "dp2", (fun (s,_) -> t s :: t s :: s);
  |]

I have altered the meaining of ‘!’ to represent the tail of the list (or, if the list is empty, the empty list), and defined ‘t’ as the top of the list (or 0. if the list is empty).

The evaluation algorithm is extremely straightforward:

let run p x =
  List.fold_left ( *. ) 1.
    (Array.fold_left (fun s (_,o) -> o (s,x)) [] p)

I then apply simulated annealing to such programs (using a program size of 10 instructions) for 10000 iterations. The energy function is simply the error between the value computed by the program and the expected value. For instance, with the function ‘fun x -> x’, simulated annealing gives me the program:

cos dup pop nop bra cos dp2 nop var nop

Which can be rewritten as:

let t = cos (if 0 > (cos 0) then 0 else 0) in t *. t *. t *. x

This is a fairly convoluted way of saying “x”. Another example, with the function ‘fun x -> x +. x *. x’:

max bra one dup nop max add var add var

The infix form here would be:

((if max 0 0 > 0 then 0 else 0) + (max 1 1) + x) * x

Which is just ‘(1+x)*x’.

An Exercise in Terseness

Terse: smoothly elegant, devoid of superfluity

Programming languages have cultures that revolve around them, which are both a cause and a consequence of the idioms and techniques found in these languages. Some languages, such as Perl or Objective Caml, have small communities around them that lead to a reasonable consistency in their cultural aspect: while both value terseness as a sign of good design, Perl hackers rely on the wild tentacled expressiveness of the language while Objective Caml hackers rely on streamlined mathematical representations that the language expresses concisely.

Other languages are larger, often due to their application in so many situations that a single language-minded community cannot contain them. JavaScript ranges from download-free-script online communities with non-portable low-quality code to elegant functional programming to code generated from Java, while C++ ranges from academic numerical computations inspired from earlier proceedings in the C language to modern reliance on standard library generic and functional programming to large-scale object-oriented development.

Tentacled Onion

When I say that the Perl language has a wild tentacled expressiveness, it is more of an admiring stance than a negative analysis. Consider, for instance, the “underscore” special variable $_ : in many situations, if no variable is used to assign the result of an expression, then that result is assigned to $_. For instance:

while(defined($_ = <>)) print $_; 

# Equivalent to : 

while(<>) print;

To the average Perl hacker, this concept is an illustration of a fundamental principle (as the standard documentation puts it, “this may seem like an odd thing to you, but you’ll use the construct in almost every Perl script you write”). In terms of language semantics, the behavior is pretty simple to explain: a program is a collection of operations which read values and output values, where the output of one operation is the input of another. This is true for almost all programming languages (with the exception of some languages, such as Prolog).

Writing a program is the description of what operations to run, and what data goes from which operation to which operation. Concise languages tend to describe operations by naming them, and make data flows implicit by various means. Perl takes a “don’t write what you don’t need to specify” that is shared by several other languages from SH to Ruby: when it’s obvious that a value has to be assigned to something, there’s a default assignment target that will keep the programmer happy most of the time (the $_ special variable). And, in general, if an acceptable default behavior exists, it is used.

Go Forth and Multiply

Forth makes data flow completely implicit, through slightly different means: instead of being stored in variables, data is pushed onto a stack. Every operation therefore reads the top values from the stack, and its output is pushed back on top of the stack when it ends. As such, a Forth program is a sequence of words, and every word is an operation.

This extends to “function” definitions, which don’t have to name their parameters and instead read directly from the stack:

: abs dup 0 < if negate then ;

This definition of “abs” duplicates the top element, pushes zero, then pops the two top elements and pushes a boolean determining whether the bottom was smaller than the top, then pops that boolean and, if true, pops the top and pushes its negation.

Of course, terseness and conciseness leads to shorter code for equivalent functionality, and shorter code usually leads to more difficulties in understanding. For instance, long word definitions are frowned upon in Forth because keeping track of the stack is hard (its implicit manipulation is harder to follow than the less concise but more explicit technique of naming values for later use). The same goes for Perl, where the implicit use of a default behavior requires knowledge of what that default behavior is (and even the most war-worn Perl hacker can be surprised by the behavior of an exceptionally devious and elegant piece of code).

Paths, Expressions and Domains

One area where conciseness does not create difficulties in understanding is domain-specific languages. If a general-purpose language can describe a shopping cart in thirty bytes, there’s a high probability that those thirty bytes will be cleverly written and therefore difficult to understand. This is an elementary cookie-based shopping cart implementation in Perl:

@P=sub{lijs while<CKE>;}ul$<chomp; sub{~"http://.*";print}

Just kidding. On the other hand, a thirty-byte description of a shopping cart in a scripting language designed for creating e-commerce web sites is boringly unsurprising: this is what the language was designed for, and one can expect it to get to the point fast enough. Everyone who uses such a language already knows what a shopping cart is and therefore has no difficulties understanding the code.

XPath is an example of a concise yet understandable language: it has a clean and small domain (identification of nodesets within an XML document) and a syntax that matches. For instance, were I to look for all the ‘li’ elements of an XML document that are inside a ‘div’ named “menu”:

 /descendant::div[attribute::class="menu"]/descendant::li

Or, as a shorter version (using the shorthand syntax of XPath):

//div[class="menu"]//li

The equivalent search without XPath in any general-purpose language (even with a good library for traversing XML) would be much longer than that. The same happens for regular expressions: a complex language for the small domain of matching patterns within text (and possibly replacing them with something else), that aims to be concise above all.

An Exercise in Terseness

I have often been frustrated by the verbosity of XSLT as a tree transform language. Looking at the w3schools tutorials, for example, one can see an example such as:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> 

<xsl:template match="/">
  <html>
  <body>
    <h2>My CD Collection</h2>
    <table border="1">
      <tr bgcolor="#9acd32">
        <th>Title</th>
        <th>Artist</th>
      </tr>
      <xsl:for-each select="catalog/cd">
      <tr>
        <td><xsl:value-of select="title"/></td>
        <td><xsl:value-of select="artist"/></td>
      </tr>
      </xsl:for-each>
    </table>
  </body>
  </html>
</xsl:template> 

</xsl:stylesheet>

This is, at least in my opinion, a mighty long piece of code for such a small thing. Using a stack-based approach similar to Forth, one could make the above look like:

: ~/
  "My CD Collection" /h2
    "Title" "Artist" . ( /th ) "#9acd32" @bgcolor /tr
    ~catalog/cd { $ ~title # ~artist # . ( /td ) /tr } .
  "1" @border /table
/body /html ;

The semantics of this language use two elements: a stack of output node sets and a stack of input node sets. It still works based on words:

Word Semantics XSLT Equivalent
“foo”
Pushes the text node “foo” on top of the output stack. Raw text
/foo
Pops the top of the output stack and replaces it with an element named “foo” containing all the elements that were in the popped node set. <foo></foo>
@foo
Pops the top of the output stack, then adds an attribute named “foo” containing the text representation of all the elements in the popped node set o the new top of the output stack. foo=
~foo
Pops the top of the input stack and pushes any of its contents that match the XPath “foo”. select=”"
$
Duplicate the top of the input stack. Automatic
.
Concatenate the top two output node sets into one. Automatic
{ }
Pops the top of the input stack, pushes an empty node set onto the output stack. For every node in the popped node set, pushes that node on the input stack and executes the code in brackets, adding an implicit . at the end of that code. xsl:for-each
( ) Same as { }, but pops from the output stack instead. XSLT 2.0 only.
# Pops the top of the input stack and pushes its text value to the top of the output stack. xsl:value-of
: ~foo ; Defines a template that matches the pattern “foo”. xsl:template
$: Pops the top of the input stack, pushes an empty node set onto the output stack. Then, pushes every element of the popped node set that matches an existing template onto the input stack and runs that template, then concatenates the top two output node sets. xsl:apply-templates

This shortens the resulting transform script, while reducing its readability (now, the generated XML is not “obviously present” in the script, but rather implied by the sequence of words that construct it).

If you remember the first example from this earlier article, you could rewrite it as such:

: ~/    $: /ul ;
: ~file ~@name # /p /li ;
: ~dir  $ ~@name # /p $: /ul . /li ;

When you’re careful, you can make out the meaning of this. However, it’s harder than with the original:

<xsl:template match="dir">
  <li>
    <p><xsl:value-of select="@name"/></p>
    <ul><xsl:apply-templates/></ul>
  </li>
</xsl:template> 

<xsl:template match="file">
  <li><p><xsl:value-of select="@name"/></p></li>
</xsl:template> 

<xsl:template match="/">
  <ul><xsl:apply-templates/></ul>
</xsl:template>

Parsing the language is extremely easy: a word is either a string literal or a sequence of non-whitespace characters, a template is defined as : ~foo … ; and a program is a sequence of templates. Of course, extracting the loops and conditionals may require some additional effort, but it remains extremely easy to get done. Writing a virtual machine is equally easy as long as XML extraction and manipulation functions for node sets are available. Even static analysis (of stack depth) is quite simple because of the requirement that every template and loop pushes and pops a fixed number of elements from the stack : a linear-time left-to-right traversal can determine whether the number of elements pushed and popped corresponds to the requirements, and generate code that runs the program. If the program outputs the file directly, an optimization pass can be made with a slightly increased difficulty (involving, mostly, executing the individual templates with slightly different semantics) in order to generate the output file linearly.

(Meta) language, Part 3

Specification attempt

This article attempts to chalk up a specification of the main syntax elements of the (meta)language being designed. The language codename chosen here is Xed.

Xed is an expression language (meaning that all syntax constructs in the language are expressions). It’s also a pure functional language, and although some expressions may appear to behave as side-effects for readability purposes they are ultimately implemented as pure functional operations.

Every expression in the language executes in a context. That context is composed of a table of local variables (with the associated values), including the special values _ and $, a list of persistent entities and relationships and their initial values (as read from persistent storage when the script is executed) and a table of modifications applied to those persistent entities (commited to the persistent storage when the script finishes executing).

Value types

Xed can manipulate several types of objects :

  • Base scalar types (integers, booleans), support the common arithmetic and boolean operators.
  • Base vector types (strings, streams), support size-of, access, shift/push, traversal and concatenation.
  • Base associative types (hashes), support size-of, access, set/unset, traversal and merge. Setting a new binding for a value hides the old one instead of replacing it.
  • Objects, mostly equivalent to hashes except that the types of values that are associated are not restricted to a single type.
  • Variants, equivalent to the polymorphic variant type in Objective Caml.
  • Persistent data types, used when defining relationships.
  • Closures (including patterns).
  • Names.

Syntax and semantics

The syntax, with its associated semantics, of all relevant elements are detailed below.

Object semantics

The most fundamental semantics are those of objects. An object is an association that binds arbitrary values to names. Adding a new binding merely hides the existing binding for that name, so that removing a binding restores the previous one.

  • [obj.empty]
    The empty object : { } is an expression that creates an empty object.
  • [obj.make]
    Create an object by defining its members, using shorthand notation. Uses the form { name = expr; … } to bind an expression to each member. This form is best used when defining objects that contain simple values.
  • [obj.add.short]
    Add members to an object using shorthand notation. Uses the form expr .+ { name = expr ; … } to bind an expression to each member. This form is best used when extending objects that contain simple values.
  • [obj.add.def]
    Add a member to an object using the definition syntax. Uses the form expr def name args = expr end to bind the second expression to the provided name in the object created by the first expression. If arguments are present, defines a closure. In the second expression, the special variable $ refers to the object created by the first expression.
  • [obj.add.rec]
    Add a member to an object using the recursive definition syntax. Uses the form expr def rec name args = expr and name args = expr … end to add all the specified members to the object defined by the first expression. The special variable $ refers to the object being created, and not the old one.
  • [obj.undef]
    Un-define a member of an object. Creates a new object where that member does not exist anymore. If an earlier definition was hidden by the undefined member, that definition becomes available again. Uses the syntax expr undef name, or the abbreviated syntax expr .- name. Causes an error if no such definition exists. This approach is used to implement private members and data hiding.
  • [obj.undef.all]
    Un-define a member of an object completely. Creates a new object where that does not have that member anymore (effectively removes the member and all the members hidden by it). Uses the syntax expr undef-all name.
  • [obj.ext.add]
    Add an object as a member of another object. expr object name = <text> end-object is equivalent to expr def name = { $ = $ } <text> end. This syntax is an extension that is not available by default.
  • [obj.mem]
    Extracts the selected member, using the syntax expr.name. Generates an error if the member does not exist.

Closure semantics

Closures carry with them state information extracted from the context. They do not, however, keep the list of modifications to the persistent state, which is provided to them when they are called and not when they are created. All local variables referring to the above context are carried over.

  • [fun.make]
    Create a new closure, using the syntax { args -> expr }. There is no limit on the number of arguments, but pattern matching is not allowed.
  • [fun.match]
    Create a new closure, using the syntax { pattern -> expr | pattern -> expr }. This function accepts a single argument, but that argument is pattern-matched. Patterns may be either variant type patterns or regular expression patterns, which bind variables that are available in the expressions.
  • [fun.call]
    A function is called on an argument using the syntax expr expr.

Pipe semantics

A lot of operations are performed using pipes, which serve as a shorthand notation for many elementary operations.

  • [pipe.call]
    A pipe can be used to call a function on an argument. For instance, a |> f is rewritten to f a. The exception is if f contains the special variable _, at which point it is rewritten as if it were equivalent to { _ -> f } a. This specificity of the right-hand operator is called the implicit lambda rule.
  • [pipe.map]
    The expression s |: f, where s is a stream or hash, creates a new stream or hash (the same as the input) by applying f to every element of s. This expression is subject to the implicit lambda rule. If the input was a hash, the key-element association is unchanged.
  • [pipe.filter]
    The expression s |? f keeps only the elements of the input stream or hash such that f x is true. This expression is subject to the implicit lambda rule. The key-element association of hashes is kept.
  • [pipe.hash.filter]
    The expression s |# f filters a hash based on its keys instead of its values.

Note that the combination of the pipe.call operation and the fun.math operation allows an operation equivalent to the match-with or switch-case pattern-matching constructs.

Name, Variant, Exception semantics

The universal variant type works on the same basis as Objective Caml polymorphic variants. Exceptions are thrown and caught based on their name, and they may carry associated values. They are ultimately members of the variant type, and so they do not have to be defined.

  • [name.fresh]
    Creates a fresh name, different from all the other names in the program. The syntax is simply fresh.
  • [name.lit]
    Creates a name literal. The syntax is `name`
  • [name.use]
    In any situation where a (syntactical) name is expected and a (semantical) name is available, one can use the approach (name) to instantiate it. For instance, obj undef (name) or `new[(name)]` or even `(name)`.
  • [var.make]
    A new member of a variant type is created with expr::name, where the name can be chosen on the spot (no definition required). This may be used in pattern-matching expressions. Note that in orderto support variants bound to an unit value, a cleaner syntax is :name (equivalent to ()::name).
  • [exn.throw]
    Throws a variant object as an exception. The syntax is merely throw expr. The current execution context is completely lost (with the exception of any values referenced by the thrown expression), meaning in particular that any modifications applied to the persistent data model are thrown away.
  • [exn.catch]
    Catches an exception if it matches a certain variant name, using the syntax try expr catch { pattern -> expr | … }. If an exception is thrown by the first expression, then pattern-matching finds the corresponding expression. If one is found, it’s evaluated in the context of the try-catch expression. Otherwise, the exception is propagated upwards.

Note that the language is fully exception-safe: when an exception is thrown, no trace of the execution of the throwing code is left behind, except whatever is saved by the exception itself. This means that any side-effects are automatically rolled back, removing the need for a “finally” clause. Another important element is the fact that all exceptions must be caught before the escape the script (meaning no script will be interrupted by an unexpected exception).

Pattern semantics

Patterns are functions which map objects to objects. Such functions can be created and called by basic means, but the following syntactic sugar is available:

  • [pattern.ext.def]
    Defines a pattern as a member of an object. The expression is expr pattern name args = <text> end-pattern is equivalent to expr def name args $ = $ <text> end. This syntax is an extension that is not available by default.
  • [pattern.ext.use]
    Uses a pattern of an object. The expression is expr use-pattern name args and is equivalent to expr |> name args. This syntax is an extension that is not available by default.

Operational Semantics

Formal language semantics describe how a language should behave. Most of the time, semantics define the values manipulated by the program, as well as the state of the program, and describe how the various expressions and statements change the state and create values. When designing a virtual machine for a language, translating the formal language semantics into code is usually the easiest way to implement the language. Elementary semantics are quite easy to grasp, so let’s start with an example:

[|literal|] → literal 
[|X + Y|]   plus  ([|X|],[|Y|])
[|X - Y|]minus ([|X|],[|Y|])

These are operational semantics for an elementary language that computes sums and differences of numbers. The color code I am using is bold green for expressions being evaluated, orange for virtual machine operations and black for values. Operational semantics exmplain how to evaluate an expression by evaluating its sub-expressions and then applying an operation to their values. Applying the semantics above to the expression 1 + 2 – 3 yields:

[|1 + 2 - 3|] minus ([|1 + 2|],[|3|])
→ minus (plus ([|1|],[|2|]), 3 )
→ minus (plus ( 1 , 2 ), 3 )
→ minus ( 3 , 3 ) → 0

For Objective Caml readers, some example code equivalent to the above would be:

let rec eval = function
  | `literal x -> x
  | `add(x,y)  -> eval x + eval y
  | `sub(x,y)  -> eval x - eval y 

eval (`sub(`add(`literal 1,`literal 2), `literal 3)

The main point of using operational semantics to describe the formal semantics of a program language is that they are very concise, allowing definitions to take a streamlined form. The main difficulty appears when one attempts to introduce more complex behavior into the language, but this difficulty is solved by a multitude of techniques. I tend to give Objective Caml examples for several reasons. The main reason is that they are concise when expression tree traversal is involved, because of pattern-matching.

Context

Usually, the evaluation of a part of a program is not independent of the rest of the program. While simple computations (such as the above one) are standalone, any expression that involves a name will require context in order to resolve that name. For instance, if a program reads the value of user-defined variable FROBNICATE, it needs to know what that variable represents, and that information is extracted from the context. A basic example of context-based evaluation would be:

[|literal|]K  → literal 
[|variable|]K K[variable]
[|X + Y|]Kplus  ([|X|]K,[|Y|]K)
[|X - Y|]Kminus ([|X|]K,[|Y|]K)

I’m using red font to represent the context. The new expression is the variable, which when evaluated is replaced with whatever value is associated to its name in the evaluation context. So, evaluating TWO + 1 in the context K={TWO:2} yields:

[|TWO + 1|]{TWO:2}plus ([|TWO|]{TWO:2},[|1|]{TWO:2})
→ plus ({TWO:2}[TWO],1)
→ plus (2,1)
→ 3

Example Objective Caml code:

let rec eval k = function
  | `lit x     -> x
  | `var s    -> List.assoc s k
  | `add(x,y) -> eval k x + eval k y
  | `sub(x,y) -> eval k x - eval k y 

eval ["TWO",2] (`add(`var "TWO",`lit 1))

In short, a context is usually a set of bindings between names and values that every variable expression can draw from in order to resolve itself into a value. Of course, an evaluation problem happens whenever a variable does not exist in the context in which it is evaluated. Among the possibilities for resolving that situation is using static analysis to make sure that every variable is defined before being used (this is what compiled languages do), deciding that if a variable is not bound in a context, then its value is null (this is what most scripting languages do), or reporting an error (this is what LaTeX and some other shell scripting languages do).

Once context-reading semantics are in place, it can be interesting to create context-writing semantics. One such example is the let-in binding expression of ML family languages, which adds the following rule to the previous four:

[|let name = X in Y|]K[|Y]]K' where K' = K + {name:[|X|]K}

This expression evaluates its first operand in the current context, then creates a new context by binding the name to that value (and keeping all the bindings of the current context), then evaluates its second operand in that new context and returns its value. Note that the original context is not modified, so other expressions don’t have access to the new binding. For instance:

[|let x = 1 in let y = x + 2 in x + y|]{}[|let y = x+2 in x + y]]K where K = {} + {x:[|1|]{}}
→ [|let y = x+2 in x + y]]{x:1}[|x + y|]K where K = {x:1} + {y:[|x+2|]{x:1}}
→ [|x + y|]K where K = {x:1} + {y:plus ([|x|]{x:1}, [|2|]{x:1})}
→ [|x + y|]K where K = {x:1} + {y:plus ({x:1}[x], 2)}
→ [|x + y|]K where K = {x:1} + {y:plus (1, 2)}
→ [|x + y|]{x:1,y:3}plus ([|x|]{x:1,y:3}, [|y|]{x:1,y:3})
→ plus ({x:1,y:3}[x], {x:1,y:3}[y])
→ plus (1,3)
→ 4

Example Objective Caml code for that evaluator:

let rec eval k = function
  | `lit x      -> x
  | `var s      -> List.assoc s k
  | `add(x,y)   -> eval k x + eval k y
  | `sub(x,y)   -> eval k x - eval k y
  | `def(s,x,y) -> eval ( (s, eval k x) :: k ) y

Stacks and de Bruijn Indices

Comparing strings to extract a variable from a context is often a bad idea, performance-wise. This is sometimes necessary, for instance when semantics allow runtime creation of variable names (such as $$name in PHP). In practice, when compile-time nested variable scopes exist (as outlined above) using de Bruijn indices and a stack instead of an associative container allow an increase in performance.

Whenever a variable is defined, it is pushed onto a stack. When the scope of that variable ends, it is popped off the stack. The de Bruijn index of a variable is the depth of that variable on the stack, which allows retrieving it. The de Bruijn index of any variable can be computed at compile-time through a very simple static analysis which stores a stack of variable names and replaces every name it encounters with the current depth of that name. Since the parser respects the same scoping rules as the interpreter, the correct association is made between variable names and stack depth.

An example of Objective Caml evaluator that uses de Bruijn indices, with implicit popping (but a performance penalty due to non-random access):

let eval k = function
  | `lit x    -> x
  | `var i    -> List.nth k i
  | `add(x,y) -> eval k x + eval k y
  | `sub(x,y) -> eval k x - eval k y
  | `def(x,y) -> eval ( eval k x :: k ) y

An alternative is to use a random-access container for the stack to access values in constant time instead of proportional to their depth. In general, de Bruijn indices are interesting both for writing interpreters and for performing static analysis in code. Ideally, one of the first steps performed by a parser is to transform variable names to de Bruijn indices. An interesting side-effect is that this also provides a clean algorithm to determine whether all variables are defined:

let rec pos x = function
  | [] -> raise Not_found
  | h::t ->
      if x = h then 0
      else 1 + pos x t 

let rec db s = function
  | `lit x      -> `lit x
  | `var n      -> `var(pos n s)
  | `def(n,x,y) -> `def(db s x,db (n::s) y)
  | `add(a,b)   -> `add(db s a,db s b)
  | `sub(a,b)   -> `sub(db s a,db s b)

Functions

Another core functionality of many languages is functions. The behavior of functions is usually classified in three ways. First, the macros are a syntactic shorthand which is expanded at compile-time instead of runtime. So, the macro which adds a number with itself would work as follows:

[|TWICE(X)|]K → [|X+X|]K

The problem, of course, is that the expression is repeated twice, and as such it will be evaluated twice. So, for instance:

[|TWICE(1+2)|]{}[|(1+2)+(1+2)|]{}
plus([|1+2|]{},[|1+2|]{})
→ plus(plus([|1|]{},[|2|]{},plus([|1|]{},[|2|]{})
→ plus(plus(1,2),plus(1,2))
→ plus(3,3)
→ 6

Here, the macro duplicated the subexpression, then the subexpression itself was evaluated once for each member. This happens, for instance, with C preprocessor macros which merely duplicate the code, instead of working with the value of the code. As a result, macros are powerful in terms of expressiveness (because they can rewrite code arbitrarily, even outside of expressions) but not in terms of performance (or, in fact, in terms of semantics when side-effects exist that only have to be executed once). Besides, it’s impossible to have recursive macros, since the expansion would be infinite.

The improved alternative to macros is functions. Unlike macros, which are syntactic, functions perform a similar substitution at runtime. To make things simpler, we’ll consider here a function with a single parameter. The semantic rule for a function call is:

[|F(X)|]K[|B|]K' where K' = {a:[|X|]K} and λa.B = [|F|]K

Now, let’s understand this. To evaluate a function-call expression, one first evaluates the function expression (F), which should result in a function object (λa.B). A function object, here, is a combination of a name (a) called the parameter and an expression (B) called the body.

Once these are known, the argument expression is evaluated (x), and its result is bound to the name of the parameter in a brand new context (K’). The body is then evaluated in that context. So, if we consider a function such as:

[|TWICE|]K → λx.x+x

Then the previous example would now evaluate as follows (for conciseness, I now skip the evaluation of simple arithmetic expressions):

[|TWICE(1+2)|]{}[|B|]K where K = {a:[|1+2|]{}} and λa.B = [|TWICE|]{}[|x+x|]{x:3}
→ 6

Objective Caml example:

let add = function
  | `Int x, `Int y -> `Int (x+y)
  | _              -> raise TypeError 

let sub = function
  | `Int x, `Int y -> `Int (x-y)
  | _              -> raise TypeError 

let rec eval s =
  | `var i     -> List.assoc i s
  | `lit i     -> `Int i
  | `twice     -> `Func ("x", `add (`var "x",`var "x"))
  | `add(a,b)  -> add (eval s a) (eval s b)
  | `sub(a,b)  -> sub (eval s a) (eval s b)
  | `call(f,x) ->
    match eval s f with
      | `Func (a,b) -> eval [a,eval s x] b
      | _           -> raise TypeError

However, in this example, the function name was bound to a value as part of the language semantics. How is it possible to define functions as part of the program? Several possibilities exist.

One approach, taken by many languages (including C and PHP) is to make functions a part of the context. When a function is called, its body is evaluated in a context which is composed of the global context (that is, the set of all the functions and variables that were defined at global scope) and the values of the parameters. The advantages to this are multiple:

  • The function now has access to every function and variable in the global scope. Of course, for single-pass compilation reasons, C requires that the variables and functions are declared before they are used (so that the compiler doesn’t have to backtrack once a definition is found) but both functions and variables can be defined anywhere.
  • Since a function only references its own local data, plus the data in the global scope, one does not have to care about the lifetime of normal values: either the value is global (it remains around for the entire duration of the program) or it is local (it disappears when the function returns).

So, the new rule in that situation is :

[|F(X)|]K[|B|]K' where K' = GLOBALS + {a:[|X|]K} and λa.B = [|F|]K

The problem that this introduces is that the global context has to be filled, and normal expressions cannot be used to fill it because they might contain non-global variables introduces with the let-in construct. Programming languages tend to define a syntax for non-local scope definitions that is different from the normal syntax at local scope. C, for instance, defines functions and globals at file scope, and allows statements and local definitions at function scope. C# and Java allow class definitions at namespace/package scope, method/variable definitions at class scope, and normal statements at method scope. Objective Caml allows global definitions at module scope and normal expressions everywhere else. Some dynamic languages, such as PHP, notoriously work around this issue by allowing runtime extension of the global context from anywhere (so functions can define other global functions).

Closures

Closures are an extension of the previous solution based on a global context. Instead of having all functions use the same global context as a starting point, closures allow each function to keep its own base context. As such, a closure associates a context with every function, something that will be expressed as λa.B{K}. The classic way of creating a closure is to use the context that was available at the point where the closure is created. So, the evaluation rules would be:

[|F(X)|]K[|B|]K" where K" = K' + {a:[|X|]K} and λa.B{K'} = [|F|]K
[|{a -> B}|]K → λa.B{K}

The ability to carry any context is both an advantage and a disadvantage. The advantage is that this adds to the expressive power of the language by allowing the creation of arbitrary functions in arbitrary locations, instead of having to rely solely on parameters and global variables. The disadvantage is that this implies local values can be carried out of their scope by a closure, which implies a garbage collection scheme or restrictions on what closures can do.

Consider a closure which squares a function:

{f -> {x -> f(f(x))}}

And consider the ‘twice’ function from above:

{x -> x + x}

An example of application would be:

[|{f -> {x -> f(f(x))}}({x -> x + x})(3)|]{}[|B|]K" where K" = K' + {a:[|3|]{}} and λa.B{K'} = [|{f -> {x -> f(f(x))}}({x -> x + x})|]{}[|B|]K" where K" = K' + {a:3} and λa.B{K'} = [|B'|]K""
          where K"" = K"' + {a':[|{x -> x + x}|]{}} and λa'.B'{K"'} = [|{f -> {x -> f(f(x))}}|]{}[|B|]K" where K" = K' + {a:3} and λa.B{K'} = [|{x -> f(f(x))}|]{f:λx.x+x{}}[|f(f(x))|]{f:λx.x+x{} ; x:3}[|B|]K" where K" = K' + {a:[|f(x)|]{f:λx.x+x{} ; x:3}} and λa.B{K'} = [|f|]{f:λx.x+x{} ; x:3}[|x+x|]K' where K' = {x:[|B|]K"'}
            where K"' = K" + {a:[|x|]{f:λx.x+x{} ; x:3}} and λa.B{K"} = [|f|]{f:λx.x+x{} ; x:3}[|x+x|]{x:6}
→ 12

A bit complex? Consider this in more simple terms: {f -> {x -> f(f(x))}}(twice) evaluates to {x -> twice(twice(x))}. Then, {x -> twice(twice(x)}(3)} evaluates to twice(twice(3)), which is twice(6), which is 12. The context allows the closure to remember that in the expression f(f(x)), f is twice.

This is, in essence, the nature of lambda-calculus: a language built around functions “λa.B” and applications “f(x)” such that “(λa.B)(x)” replaces all occurences of “a” in “B” by “x”. Here, we have written:

λf.(λx.f(f(x)))(λy.y+y)(3)
→ λx.(λy.y+y)((λy.y+y)(x))(3)
→ (λy.y+y)((λy.y+y)(3))
→ (λy.y+y)(3+3)
→ 6+6
→ 12

For those who are familiar with it, our evaluation technique is equivalent to lambda-calculus with the rightmost redex being reduced first.

Using this approach, it’s possible to use De Bruijn indices with the exact same technique as before: a closure expression merely introduces a new variable (the formal parameter) in its body expression, thereby increasing the stack depth by one. All the variables in the carried context are available using the same depth as if the closure was in fact merely a let-in binding. In fact, De Bruijn index actually originate from De Bruijn’s work on lambda-calculus.

So, another interpretation of a closure is that it is a let-in binding, but which still needs to be provided with the actual expression to be bound. That is: {a -> B} is like let a = (hole) in B, where the hole is filled by calling the resulting expression on an argument.

Recursion

Using the global context as an evaluation context for every function allowed recursive functions without any additional work: when the function wanted to call itself, it could find its own name in the global context (since the execution of the program only starts after all global functions have been defined).

This is not the case for closures, however: a closure does not necessarily have a name (it’s just an anonymous expression), and the only way to give it a name with the current tools is to use a let-in binding expression. This means, however, that the name of the closure is only available after the “in” keyword, whereas the definition itself (which must perform the recursive calls) is present before the “in” keyword.

So, the problem is that in order to perform a recursive call, a closure must be bound as part of the context in which its body is evaluates. Since the solution used by global-evaluation schemes (which is to place the closure in the global context, and evaluate the body in that global context) is not available (the context is defined when the closure itself is defined, which happens before the closure actually exists in order to be given a name). Therefore, the only way to introduce the closure into the context without having to alter the evaluation system is by making the function its own argument, using the following semantics:

[|let rec n = f in e|]K[|e|]K' where K' = K + {a:μ(n,a).B{K"}} and λa.B{K"} = [|f|]K
[|f(x)|]K[|B|]K"' where K"' = K' + {n:μ(n,a).B{K"} ; a:[|x|]K} if μ(n,a).B{K"} = [|f|]K

These semantics add a new type to the virtual machine: a recursive closure. When called on an argument, a recursive closure defines two values within its context: the argument value (as the parameter name) and itself (as the recursive name). Recursive closures are defined by means of the “let-rec-in” construct that turns a normal closure into a recursive closure. This expression requires special De Bruijn rules to be parsed, since it introduces a new variable (the name itself) to the left of the “in” variable, but only within closures.

Multiple, mutual recursion is an extension of this principle, left as an exercise to the reader. Also, the astute reader familiar with fix-point operators might have recognized the relationship between the form of recursive calls (f f x) with the Y fixpoint operator of lambda-calculus.

However, this approach involving an additional value are too complex to implement when a simpler solution is possible: the recursive definition is only required within the body of the closure itself. Therefore, passing the closure to itself should only happen within that body. And since the body is cleanly delimited at parse-time, adding those calls syntactically instead of semantically is very easy. As such, it is merely a question of replacing “let rec f = a in b” with “let f = (let f = {f -> a} in f(f)) in b” where all occurences of the variable “f” in a have been replaced with “f(f)“. Once this syntactic replacement is performed, the previous virtual machine (with no recursive closures) can be used.

State

So far, there have been two ways of communicating data within the expressions. The value descends from the leaves to the roots, while context moves from the roots to the leaves, with possibilities for converting a value into a context element (let-in) or a context element into a value (variables).

Side-effects have the annoying property of appearing in an expression and affecting the nature of another expression without being part of the value of the first expression. This is for instance the case where one closure increments a variable while another closure reads it: the return value of the first closure cannot (and should not, for purposes of encapsulation) be provided to the second closure.

Since neither values (by definition) nor context (which only moves into expressions and never out) can represent side-effects, a new concept must be introduced. This concept is known as state. Unlike both values and context, which always move in a single direction, state moves in both. Every expression, when evaluated, is provided with a state (known as its initial state), and its evaluation yields another state (known as its final state). What the state really is depends on the application: it could merely represent the mutable variables of the program, or it could also represent input-output.

For the purposes of this article, the state will consist here of only mutable variables. The exact representation used here is a dynamic array of values, indexed from zero to infinity. Two expressions are allowed to interact with the context using the following rules:

[|ref x|]K,S → i,S' + {i:v} where i=length(S') and v,S' = [|x|]K,S
[|!r|]K,SS'[i],S' where i,S' = [|r|]K,S
[|r:=x|]K,S → v,S" + {i:v} where i,S' = [|r|]K,S and v,S" = [|x|]K,S'   

The “ref x” expression adds a new value to the state, then returns a reference to that value (in practice, the index of that value within the state). The “!r” expression extracts the current value of the reference “r” from the state. The “r:=x” expression changes the value associated with the reference “r” in the state to the value of “x”.

A common pattern, once state is involved, is to keep in mind that every expression has to read and write the state, meaning that the state will be travelling as an argument and return value everywhere. In practice, it’s possible to represent both values and context using the state (this is how virtual machines programmed in imperative languages work), but for clarity it will not be done here.

The existence of state means that some expressions will be used solely for their impact on the state, and not for their value. This results in the creation of the sequence expression, represented in Objective Caml using a semicolon, and with the semantics:

[|a ; b|]K,S[|b|]K,S' where _,S' = [|a|]K,S   

(Meta) language, Part 2

Part one can be found here, and I strongly suggest reading it first. Among the modifications that will be encountered here, the most important is that modules are now called objects, to improve readability.

This part discusses using names as first-class elements, and the underlying type system.

Names as half-class citizens

In the vast majority of statically typed languages and compiled languages, names cannot be manipulated. By contrast, most dynamic languages handle names. After all, the very notion of anonymous function, in PHP, is handled by giving the function a name at runtime and then passing that name around. It’s perfectly sensible to write the code below, and its result will be the expected one.

function square($x) { return $x * $x; }
$func = 'square';
print $func(10);

The PHP language restricts this feature to a few areas: when calling a function ($name(args) syntax), accessing a member ($obj -> $name syntax), creating a class (new $name(args) syntax) and accessing a variable ($$name syntax). Other examples, such as function $name(args) {} definitions, are not supported (though they can be achieved using other reflection or evaluation features).

In practice, knowing the name of an existing object is not useful: pointers, references, closures, factories and member pointers are all superior alternatives to names when referencing an existing object.

Of course, this does not solve the issue of defining new objects by name.

The need

First, the considered meta-language provides the use-pattern construct, which applies a pattern to an object. A pattern is allowed to define new members in that object. Depending on the situation, it may or may not be useful to let the new members hide the existing members. Consider for instance a pattern that adds access verification to a function. The result would be that the function is now only callable with an additional key argument that allows identity checking for access verification. The pattern must therefore know the name of the function that must be hidden.

Example source code:

pattern AddAccessVerification name valid =
  exception AccessNotAllowed
  def (name) =
  { token ->
    if ! (valid token) then throw AccessNotAllowed ;
    $.(name)
  }
end-pattern

This example hides the member function with the provided name by requiring an additional argument. If the original function was called as $.func a b c, it is now called as $.func token a b c.

A second need is the ability to store arbitrary labels as part of objects without conflicting with existing labels. For instance, if one needs to add an URL value to an entity, then one needs to also redefine the new built-in function for that entity to also take into account the URL. To avoid having to propagate a new argument, the argument to that built-in function is an object with a predefined set of labels. The URL-adding pattern must therefore create a new label for that object:

object Page =
  entity page
  property content[page] : string
  def new(page) =
  { data ->
    new $.page (object) | $.content[_] <- data.content
  }
end-object 

pattern WithUrl ent label =
  property url[$.(ent)] : string
  def new((ent)) =
  { data ->
    new $.(ent) (data hide (label)) | $.content[_] <- data.(label)
  }
end-pattern 

object UrlPage extends Page =
  use-pattern WithUrl `page` `my_url`
  def createPage =
  { text url ->
    new $.page {{ content = text ; my_url = url }}
  }
end-object

The solution

As illustrated above, names are subject to two rules:

  • A symbol quoted with backticks becomes a name. For instance, `page` is the name page. Such a name can be manipulated as any other variable. Its type is merely ‘name‘.
  • A name variable can be used wherever a member name could be used : def name = … , … hide name, new(name) …, property name = …, … entity name, pattern name = …, object name = …, value.name, {{ name = value ; … }} and so on. It must be escaped with brackets. So, for instance, def xyzzy = { 10 } is equivalent to def (`xyzzy`) = { 10 }.

The main difficulty here is working with a type system that can withstand these issues correctly. In particular, consider code such as this:

object Problematic =
  entity url
  use-pattern WithUrl `url` `url`
end-object

Without any checks, this would attempt to apply pattern WithUrl, which adds a new member named url (thereby hiding the old url member), then attempts to access the member named url as if it were an entity (because that was the provided name for the wrapped entity). Therefore, this must be caught by the type system (otherwise, the language risks the same dangers as the C++ metaprogramming features : you don’t know if you’re wrong until you try it).

The proposed type system is constraint-based. It works by associating an atom to every expression, and then expresses a set of constraints on the atoms found within a definition (such as a function, pattern or object). These constraints are verified when the atoms already exist, but when some of the atoms are parameters, the constraints are instead kept and inserted into the constraint system of the location which provides those parameters.

So, the constraint set for the WithUrl pattern would look like this:

WithUrl : (ent:name) -> (label:name) -> ($:object) -> (result:object)
with 
| $.(ent) : entity 
  obj2 = $ def url
| obj2.url : property[$.(ent) -> string]
| obj2.(ent) : entity
  result = obj2 def new((ent)) 
| new result.(ent) : (data:object) -> instanceof(result.(ent))
  with 
  | data.(label) : string
  | new obj2.(ent) (data hide label) : instanceof(result.(ent))

And so, providing `url` as ent would result in the conflicting constaints that:

WithUrl : url -> name -> (obj:object) -> object
with 
  obj2 = obj def url
| obj2.url : property[obj.url -> string]
| obj2.url : entity

Readability issues

This constraint model is much less readable than the typical ML type, and is in fact getting much closer to the C type system (which works by describing the operations that can be applied to variables). There should be no surprise that such a type system must be complex in order to account for meta-programming. But what does the type system look like when working with non-meta code?

The constraint set for the resulting UrlPage is this:

UrlPage : object
with 
| $.page : entity
| $.content : property[$.page -> string]
| $.url : property[$.page -> string]
| new $.page : (data:object) -> instance-of($.page)
  with 
  | data.text : string
  | data.url : string

This is much more readable. Is this because of the absence of pattern, or because of the absence of names? Forgetting for a short while about names, let’s rewrite the WithUrl pattern:

pattern WithUrl =
  property url[t] : string
  def new(t) =
  { data -> new $.t (data hide url) | $.content[_] <- data.url }
end-pattern 

The constraint system for this pattern is:

WithUrl : ($:object) -> (result:object)
with
| $.t : entity
  result = $ def url def new(t)
| result.url : property[result.t -> string]
| new result.t : (data:object) -> instance-of(result.t)
  with
  | data.url : string
  | new $.t (data hide url) : instance-of(result.t)

While slightly more complex, this remains quite manageable when considering the complexity of patterns. Equivalent systems are either too lax (C++ templates), too restrictive (C# generics) or just as verbose (Objective Caml functors).

Introducing Name Constraints

The simple solution would be to add name equality constraints. That is, if ent was known not to equal `url` or any of the names used internally by the pattern, then the constraint system would be much simpler. For instance:

pattern WithUrl ent label =
  assert { ent <> `url` }
  property url[$.(ent)] : string
  def new((ent)) =
  { data ->
    new $.(ent) (data hide (label)) | $.content[_] <- data.(label)
  }
end-pattern 

The assertion protects the following definition of url of overwriting $.ent, thereby allowing a more compact constraint set:

WithUrl : (ent:name) -> (label:name) -> ($:object) -> (result:object)
with
| assert { ent <> `url`}
| $.(ent) : entity
  result = $ def url def new((ent))
| result.url : property[result.(ent) -> string]
| new result.(ent) : (data:object) -> instance-of(result.(ent))
  with
  | data.(label) : string
  | new $.(ent) (data hide (label)) : instance-of(result.(ent))

To be continued

This overview still does not have a concrete description of the actual constraint-solving algorithm (though the actual constraints can be trivially extracted from the expressions, they still need to be grouped, solved and verified). Another possible point of contention is the new-operator, which can be overloaded and thus may cause issues because it is not bound to an object but rather to an entity. A possibility is to remove the possibility to overload it altogether, and instead only allow it to be hidden.

Recursive Descent Parsers

Parsing and grammars

A grammar describes how to generate a sequence of tokens that satisfies certain rules. For instance, the grammar for a basic calculator would be:

e = 'integer' [Value]
  | e '+' e   [Add]
  | e '*' e   [Mul]
  | '(' e ')'

Parsing is the process of obtaining a derivation that generates a sequence of tokens from a grammar. For instance, a token sequence such as:

2 + 3 * 4

Can be created from the above grammar using one of the derivations:

Mul (Add (Value 2, Value 3), Value 4)
Add(Value 2, Mul(Value 3, Value 4))

Of course, the grammar used as an example is ambiguous (there are several possible derivations), and in the real world one prefers non-ambigous grammars (so that the result of the parsing is always unique). An example of non-ambiguous grammar for the above is a grammar that takes into account the differences in priority for addition and substraction. So:

base = 'integer' [Value]
     | '(' expr ')' 

add = mul
    | mul '+' add [Add]

mul = base
    | base '*' mul [Mul]

expr = add

Where the first derivation (which was incorrect as far as priorities are concerned) is now impossible, because an Add can only appear inside a Mul if it’s inside brackets.

Recursive Descent Parsers

The simplest way to construct a parser from a grammar is to use a recursive descent parser. Assuming that the input is tokenized (every integer and operator appears as a single string within a list of strings), then a parser that effectively recognizes the expressions above is:

exception Syntax 

type r =
  | Value of int
  | Add of r * r | Mul of r * r

let rec rule_base = function
  | [] -> raise Syntax
  | "(" :: t -> let e, t = rule_expr t in
                ( match t with ")" :: t -> raise Syntax | l -> e, l )
  | i :: t -> try Value (int_of_string i), t with _ -> raise Syntax
and rule_mul t =
  let a, t = rule_base t in
  match t with
    | "*" :: t -> let b, t = rule_mul t in Mul (a,b), t
    | _ -> a, t
and rule_add t =
  let a, t = rule_mul t in
  match t with
    | "+" :: t -> let b, t = rule_add t in Add (a,b), t
    | _ -> a, t
and rule_expr t = rule_add t 

let _ = rule_expr ["2";"+";"3";"*";"4"]

The parser correctly answers Add (Value 2, Mul (Value 3, Value 4)).

Handling Errors

A syntax error appears when the provided sequence does not follow the grammar, and there is therefore no possible derivation for that sequence. Syntax errors have three possible causes : a necessary token was forgotten (consider the sequence (2 + 3 * 4 where a closing brace is missing), a token was inserted where it doesn’t belong (consider the sequence 2 ++ 3 * 4 where a surnumerary plus was added) or a correct token was replacedwth an incorrect token (consider @ + 3 * 4 where the 2 was mistyped into an @). Most of the time, when a token that doesn’t fit is encountered, it’s difficult to determine whether that token is the error, or whether a token was forgotten before it (should 2 ++ 3 be 2 + 3 or 2 + 1 + 3?).

Because of this, most parsers merely indicate the problematic token where a syntax error has happened (this is what Objective Caml does) while others also add the list of tokens that could have been accepted (see for instance PHP’s infamous Paamayim Nekudotayim error message).

The problem is that naive Recursive Descent Parsers are quite bad at handling or detecting syntax errors as soon as alternatives are involved. Consider the grammar built on top of the previous one:

def = 'let' 'n' '=' def 'in' def [Def1]
    | expr                       [Def2]

prog = 'eof'                   [Prog1]
    | 'let' 'n' '=' def prog [Prog2]
    | def ';;' pro           [Prog3]

Consider now the sequence of tokens let n = 10 %, where the final token is obviously an error. The parser will explore the possible derivations in this order:

  • Prog1 fails (‘let’ is not ‘eof’).
  • Prog2 runs and reaches ‘prog’:
    • Prog1 fails (‘%’ is not ‘eof’)
    • Prog2 fails (‘%’ is not ‘let’)
    • Prog3 reaches ‘def’:
      • Def1 fails (‘%’ is not ‘let’)
      • Def2 fails (‘%’ does not exist in ‘expr’)
    • … so ‘prog’ fails.
  • … so Prog2 fails.
  • Prog3 reaches ‘def’:
    • Def1 fails (‘%’ is not ‘in’)
    • Def2 fails (‘let’ does not exist in ‘expr’)
  • … so Prog3 fails.

The last failure encountered was that ‘let’ is not a valid initial token, since it cannot generate ‘expr’. Displaying this error is of course nonsensical : the actual error is the final token, not the first one.

The solution is to store the furthest token that generated an error (in terms of distance from the first token in the token stream) along with the list of tokens that it was expected to be. In this case, the token errors would be:

  • Token 0 (‘let’) : expected ‘eof’, ‘(‘ or ‘value’
  • Token 3 (’10′) : expected ‘let’ or ‘(‘
  • Token 4 (‘%’) : expected ‘eof’, ‘let’, ‘(‘, ‘value’ or ‘in’

And so the error provided would be along the line of “Syntax error on token ‘%’ after token ’10′, expected ‘let’, ‘(‘, ‘in’, a value or the end of file”.

Another key idea to keep in mind is the fact that one can remember opening braces, and when a closing brace is expected, signal the position of the opening brace.This would help for a missing ‘=’ after a ‘let’, for instance.

A sample implementation

The implementation can be found here.

Its interface header would look like this:

(** The description of a set of tokens. *)
module type TOKENS =
  sig
    (** The available tokens. *)
    type token 

    (** Errors generated when tokens are expected but
        not found. *)
    type error 

    (** Printing an error *)
    val string_of_error : error -> string 

    (** The end-of-file token. *)
    val eof : token
  end 

(** A recursive descent parser over a set of tokens. *)
module RecursiveDescentParser :
  functor (Tokens : TOKENS) ->
    sig 

(* ------------------------------------------------------ *) 

      (** Description of a failure *)
      type failure 

      (** Show a failure. Internally uses 
          Tokens.string_of_error on involved errors. *)
      val pretty_print : failure -> string 

      (** The exception thrown when a failure occurs. *)
      exception ParseError of failure 

(* ------------------------------------------------------ *)       

      (** An expectation function. Returns true if the token
          was expected *)
      type expect = Tokens.token -> bool 

(* ------------------------------------------------------ *) 

      (** A parsing context. This is what your parsing rules
          will be passing around. *)
      type ctx 

      (** A rule : receives a context, returns whatever was
          extracted from the context along with the new 
          context (includes any possible failures or removed
          tokens. *)
      type 'a rule = ctx -> 'a * ctx 

(* ------------------------------------------------------ *) 

      (** Get the next token in a context. Does not alter
          the context. Return Tokens.eof if no more tokens
          are available. *)
      val next : ctx -> Tokens.token 

      (** Get a context minus its first token. An empty 
          context remains an empty context. The original 
          context is not altered. *)
      val shift : ctx -> ctx 

      (** Create a new context from a list of tokens *)
      val stream_of_list : Tokens.token list -> ctx 

(* ------------------------------------------------------ *) 

      (** Have the context remember a certain failure 
          (to improve reporting) before trying again to 
          parse something else. *)
      val add_failure : failure -> ctx -> ctx 

      (** Raises a [ParseError] exception bound to the
          provided context and error. *)
      val fail : Tokens.error -> ctx -> 'a 

      (** Raises a [ParseError] exception.
          [fail_close err old ctx] works like 
          [fail err ctx], but the error is also bound 
          to a former context [old]. This serves to reference
          the opening brace tied to a missing closing brace,
          or similar constructs. *)
      val fail_close : Tokens.error -> ctx -> ctx -> 'a 

      (** Either the expectation is satisfied by the next
          token, or call [fail]. *)
      val expect : expect * Tokens.error -> ctx -> ctx 

      (** As [expect], but uses [fail_close]. *)
      val expect_close : expect * Tokens.error -> ctx -> ctx -> ctx 

(* ------------------------------------------------------ *) 

      (** Given a rule, creates a parenthesized version which
          [expect]s a first token, then reads the rule, then 
          [expect_close]s a second token, then returns the data
          returned by the rule. *)
      val between :
        expect * Tokens.error -> 'a rule -> expect * Tokens.error -> 'a rule 

      (** Given a list of rules, tries every one of them in
          order and returns the data of the first that did 
          not throw an exception. If all fail, also fails. *)
      val one_of : 'a rule list -> 'a rule 

      (** Reads as many instances of the rule from the input, until
          it fails, and returns the list in order. *)
      val many : 'a rule -> 'a list rule 

    end

The interface could certainly be improved (for instance, monadic manipulation of rules, or access to the internal representation) and so could the implementation (the manyfunction is not tail-recursive yet). Also, another of the classic issues of recursive descent parser (which is left-associative operators) is not solved.

(Meta) language, Part 1

What is this?

This article proposes a simple modeling language with a few syntactic and semantic tricks that allow using the language itself as a metalanguage. The language would be used to generate a persistent data model and several queries over that data model (both to read and to write). The underlying system for representing the data is the entity-property-binding architecture from this article.

Definitions and queries

The system defines entities, properties, bindings and queries. Queries, in turn, will involve subsets (filtering entity instances based on predicates involving their properties and bindings) and operations performed to all instances in these subsets.

Consider a simple data model for a library: an author has a name, a book has an author, a title and a publication date. Subsets which we would like to manipulate (beyond singletons) are the list of books of an author, the list of books with a certain title, the books published between two dates, and the authors with a certain name. Using the appropriate syntax:

entity author
property name[author] : string

entity book
property title[book] : string
property publication[book] : date

binding written_by[book : many book ; author : one author] 

def authorByName n =
  author |? name[_] = n
end

def booksByTitle t =
  book |? title[_] = n
end

def booksBetweenDates f l =
  book |? publication[_] > f |? publication[_] < l
end 

def booksByAuthor a =
  book |? written_by[book : _ ; author : a]
end

The syntax of defining entities, properties and bindings is pretty straightforward : between square brackets, the involved entities are mentioned with the appropriate multiplicity (for bindings) or value type (for properties). Subsets are a little bit more complex: they involve a simplified functional notation.

Every entity, when used in a subset expression, represents the set of all instances of that entity. This set is then filtered using the |? operator : list |? predicate is equivalent to the Objective Caml List.filter predicate list. This involves simplified functional notatioin for making anonymous function definition shorter. Here, name[_] = n is equivalent to (fun x -> name[x] = n). To determine which expressions belong to the anonymous function and which don’t, the rule implies only syntactic context. That is, the anonymous function starts as the placeholder argument _ and grows until it reaches a context where a function would be expected (such as the function side of the operator |? or any other functional operator).

Other functional operators:

x | f is equivalent to f x

x |: f is equivalent to List.map f x

x |! f is equivalent to List.iter f x ; x

exists x: f is equivalent to List.exists f x

forall x: f is equivalent to List.forall f x

In terms of the language, sets of entities are filtered and manipulated this way. The fundamental piece of behavior in this language is therefore chaining a value through several mutators.

A few example of queries (reading the titles and publication dates of books by an author, deleting an author, changing a book title and adding a new author):

def titlesOfBooksByAuthor a =
  booksByAuthor a |: title[_], publication[_]
end

def deleteAuthor a =
  booksByAuthor a |! delete ;
  delete a
end

def setBookTitle t b =
  title[b] <- t
end

def addNewAuthor n =
  new author | name[_] <- n
end

This introduces three new constructs: delete merely deletes its argument (an entity instance) from the data model and returns unit. property[instance] <- value sets the property of the selected instance to the specified value, then returns the instance (to allow chaining (as in the example of the add-new-author query which returns the author after setting its name). new entity creates a fresh instance of the entity, for which all properties must be set before they are read. Constraint violations (binding multiplicity, the fact that properties are used before being initialized) are signaled as runtime errors and there are attempts to detect them at compile time. Queries are transactional, meaning that an exception or runtime error rolls back the entire transaction.

Last but not least, let us introduce modules to handle all the namespacing issues, using the same convention that modules start with a capital letter and non-modules don’t (this is only a convention, not a lexical constraint). The chosen syntax is:

def Library =
module
  entity author
  property name[$.author] : string

  entity book
  property title[$.book] : string
  property publication[$.book] : date

  binding written_by[book : many $.book ; author : one $.author] 

  (* ... *)

end 

By default, the $ variable represents the current module (so I could create a function which takes an argument that has the same name as a module member, and use name to represent the parameter and $.name to represent the module member). Of course, this variable is optional, which means that when a variable name cannot be resolved, a ‘$.‘ is appended to see if it exists as a module member (should this fail, an error is launched.

Metaprogramming

In the current state, all objects are first-class citizens, including modules. This means that functions can be used to manipulate entities, properties, queries, modules, and so on. For instance, assuming that testToken determines if an authorization token is valid:

def addRightsControl func token object =
  if ! testToken token object then throw AccessDenied ;
  func object
end

def safeSetBookTitle title token book =
  addRightsControl (setBookTitle title) token book
end 

An interesting addition here is the pattern keyword. First, let’s consider its usage and semantics, and let’s then examine why it makes sense within the language semantics:

pattern UriIdentified decoratedEntity =

   (* To indicate violation of the uniqueness rule *)
  exception UriNotUnique : string

  (* Associate an URI to every instance of the entity. *)
  property uri[decoratedEntity] : string

  (* Access an element by its URI. *)
  def byUri uri =
    decoratedEntity |? $.uri[_] = uri
  end

  (* Overload operator 'uri[object] <- uri' to enforce uniqueness. *)
  def write($.uri) val object =
    if ! empty ($.getByUri val) then throw ($.UriNotUnique val) ;
    $.uri[object] <- val
  end
end

def Page =
module
  (* Define basic properties of a page *)
  entity page
  property content[$.page] : string

  (* Associate pages to Unique Resource Identifiers *)
  usepattern UriIdentified $.page

  (* Provide an URI-to-content shortcut. *)
  def getContentByUri uri =
    $.getByUri uri |: $.content[_]
  end
end

This example creates a pattern : a means of extending the behavior of a module by adding new entities, properties, bindings and other various definitions to that module. The pattern in the example introduces a new exception, a new property (bound to a specified entity that was passed as an argument), a shortcut for accessing the entity (if any) with an identifier and an overloaded version of the write-operator for the property. This pattern can then be used as part of any module using the usepattern statement.

Simple things for simple people

The language now has a variety of keywords with seemingly distinct functionality. This is not actually the case: all the functionality above can be explained using the same rules revolving around chaining through a sequence of mutators.This is because, fundamentally, statements such as entity, property, usepattern, binding, exception or def are just that: mutators.The rules for these mutators are:

The expression… …returns the original module with an additional…
m entity e
entity
m def name = expr end
field with the expression bound to the name
m property p[...] : …
property
m binding b[...]
binding
m exception ex : …
exception
The expression…
…returns…
module
an empty module (with no members).
m pattern name args = defs end
same as m def name args x = x defs end
m usepattern f same as f m

In particular, a pattern is nothing more than a function which is applied to a module and returns a module, and usepattern is just a synonym for the pipe operator | (except that the lambda argument _ is replaced by $ to the right of an usepattern statement). In general, to the right of any of the keywords above save pattern, the dollar symbol represents the module being modified.

So, in essence, adding a function to a module is not any more complex than:

def addSomeMethod mod f =
  mod def method = f end
end 

(addSomeMethod module { x -> x + 1 }).method 10 (* Outputs 11 *)

An additional detail about definitions is that they overwrite previous definitions with the same name (this is also true for built-in operations, such as writing to a property). However, it’s also possible to later remove a definition, which restores the last available definition.This allows defining private elements in modules, but also in patterns which don’t want to overwrite existing definitions with the same name in other modules.

def example =
module
  def frobnicate x = x + 1 end
  def one x = $.frobnicate x end (* Returns x + 1 *)

  def frobnicate x = x / 2 end
  def two x = $.frobnicate x end (* Returns x / 2 *)
  hide frobnicate

  def three x = $.frobnicate x end (* Returns x + 1 *)
end

Other ideas

This is just a basic overview. The next articles in the series will deal with:

  • Using names as first-class variables (manipulating names, concatenating names, generating fresh names locally).
  • Some example patterns that can be easily accomplished with the system.
  • Reflection facilities for exploring a module (in addition to merely adding things to it).
  • A type / predicate system for checking and for reflection.
  • Deciding between runtime metaprogramming and compile-time metaprogramming.


1150 feed subscribers
(readers who polled a feed this week)