<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Nicollet.Net &#187; Dynamic</title>
	<atom:link href="http://www.nicollet.net/chiasma/dynamic/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.nicollet.net</link>
	<description>Everyone Loves Me</description>
	<lastBuildDate>Mon, 23 Jan 2012 16:55:59 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=</generator>
		<item>
		<title>Frameworks, Libraries, Conventions</title>
		<link>http://www.nicollet.net/2012/01/frameworks-libraries-conventions/</link>
		<comments>http://www.nicollet.net/2012/01/frameworks-libraries-conventions/#comments</comments>
		<pubDate>Thu, 05 Jan 2012 19:12:25 +0000</pubDate>
		<dc:creator>Victor Nicollet</dc:creator>
				<category><![CDATA[Dynamic]]></category>
		<category><![CDATA[Architecture]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[Zend]]></category>

		<guid isPermaLink="false">http://www.nicollet.net/?p=2649</guid>
		<description><![CDATA[Funkatron came up with the MicroPHP Manifesto : I am a PHP developer I am not a Zend Framework or Symfony or CakePHP developer I think PHP is complicated enough I like building small things I like building small things with simple purposes I like to make things that solve problems I like building small [...]]]></description>
			<content:encoded><![CDATA[<p>Funkatron came up with the <a href="http://funkatron.com/posts/the-microphp-manifesto.html" target="_blank">MicroPHP Manifesto</a> :</p>
<blockquote><p><strong>I am a PHP developer</strong></p>
<ul>
<li>I am not a Zend Framework or Symfony or CakePHP developer</li>
<li>I think PHP is complicated enough</li>
</ul>
<p><strong>I like building small things</strong></p>
<ul>
<li>I like building small things with simple purposes</li>
<li>I like to make things that solve problems</li>
<li>I like building small things that work together to solve larger problems</li>
</ul>
<p><strong>I want less code, not more</strong></p>
<ul>
<li>I want to write less code, not more</li>
<li>I want to manage less code, not more</li>
<li>I want to support less code, not more</li>
<li>I need to justify every piece of code I add to a project</li>
</ul>
<p><strong>I like simple, readable code</strong></p>
<ul>
<li>I want to write code that is easily understood</li>
<li>I want code that is easily verifiable</li>
</ul>
</blockquote>
<p>Without surprise, a large swath of the community did not take it well, for similar reasons to <a href="http://www.nicollet.net/2010/03/why-i-gave-up-on-the-zend-framework/" target="_blank">my earlier piece against Zend Framework</a> — deviation from the commonly accepted norm.</p>
<p>I have come a long way since I wrote that article, and I must have been walking in circles, because I actually ended up where I originally begun : why do we call these things <em>frameworks</em> ?</p>
<p>Zend, Symfony, CakePHP — as well as Node.js, Rails, Django, Ocsigen &#8230; — actually contribute three different things to projects that use them.</p>
<h4>Libraries</h4>
<p>A library provides <em>functionality</em> used for solving <em>general problems</em> in a flexible, <em>standalone</em> manner. <code>Zend_Mail</code> is a classic example of the library aspect of Zend Framework: you can plug it into your application and start sending e-mail. The interface you would use is uncluttered by details that are not directly related to sending e-mail.</p>
<p>The core qualities of a library are its power (how many different aspects of a problem does it let me solve — attachments, rich text, bouncing, MIME handling&#8230;) and the clarity of its interface. <strong>What problems can you solve, and how fast can you solve them?</strong></p>
<h4>Conventions</h4>
<p>When you hear «conventions» you immediately think of opening brace positions and variable naming rules. It&#8217;s about more than that.</p>
<p>The Model-View-Controller separation is an example of convention: it has been decided that under no circumstances should HTML rendering occur in Model code, no HTTP or session handling should happen in View code, and no SQL queries happen in Controller code.</p>
<p>Good conventions are designed to let the developers assume interesting properties about the code without having to actually read it. A convention like «no global variables» means I never have to care about global state in my code, ever. A convention like «view code must respect the law of Demeter» means all the data used by the view is right where it is being initialized.</p>
<p>They are also designed to make reuse and interoperability easier by reducing the number of ways in which a possible interface can be implemented. A convention could say the values are passed by assigning them to members post-construction and <strong>not</strong> as constructor arguments, so you have one less point of contention between the object that is initialized and the object that does the initialization.</p>
<p>Last but not least, conventions are usually based on experience of things that could go wrong if certain behavior is allowed. A typical example is the requirement to escape all strings as they are being output — eliminating any ambiguities as to whether the string has already been escaped elsewhere and should be output as-is: it has not.</p>
<p>Zend comes with a variety of useful conventions enforced both through the interface of its tools — <em>this</em> is how you use a view, <em>this</em> is how you define a view helper that should be available from within any view, <em>this</em> is how you bind a piece of code to an URL, and so on. I happen to disagree with many of those conventions myself — because I believe they solve the wrong problems — but they are certainly better than a project with no conventions.</p>
<p>For the reference, my PHP conventions are described in <a href="http://www.nicollet.net/ohm-least-resistance/" target="_blank">the user manual for Ohm</a>.</p>
<h4>Framework</h4>
<p>A framework is actually going a step further than mere conventions. They are super-conventions designed to be respected by plugin authors. The point is that if plugin A and plugin B respect the set of conventions provided by the framework, then they can be used together in the same application.</p>
<p>Consider a practical example : a plugin that implements a CAPTCHA field in a form, and a plugin that displays and submits a form through AJAX. On a bad day, it goes like this :</p>
<ol>
<li>When an error occurs, the server-side AJAX-form plugin sends out a small piece of JSON containing the fields that have errors, along with the error messages. A small client-side script applies these.</li>
<li>However, the CAPTCHA plugin expected the image to be reloaded when an error occurs.  It may either keep the same image and target word — defeating the purpose of a CAPTCHA — or change the target word without knowing that the image could not be changed.</li>
<li>You then need to post on StackOverflow hoping for a solution, search online for a patch to either plugin that could make it work as expected, or try to read the code to either in order to create the patch yourself.</li>
</ol>
<p>Had the framework provided a clean notion of « this field must be refreshed on every attempt » as part of their form interface, both plugins would have used it — the CAPTCHA plugin would have marked its field as such, and the AJAX plugin would have implemented a special case for such fields.</p>
<p>As such, the purpose of a framework is to provide a clean, unambigous and extensive <strong>vocabulary</strong> that all the plugins should be able to speak, and that is designed to cover as much real-world situations as possible.</p>
<p>Zend Framework and Symfony in particular do an absolutely great job of this. When you can have a pager component push its data to the page through a progressive enhancement component, and log its performance to FirePHP when an user authentication component  determines that the viewing user is a developer, and all of it works by plugging square pegs into square holes, you know there has been a lot of great work going on below the hood.</p>
<h4>Back to the point</h4>
<p>Using a framework is all fun and games until you need to disagree with it. You need to plug out what does not work, and plug your own implementation in its place. The more complex the vocabulary, and the harder it will be to write new code — frameworks make it easy to connect existing components, at the cost of having to deal with more concepts when actually implementing new things.</p>
<p>What it boils down to, in the end, is whether you expect to be reusing a lot of third party components, or to write a lot of your own code. In the latter case, MicroPHP — and lean environments that do not have a heavy framework side to them — is actually an improvement over trying to fit a six-inch wooden square peg into a mini-USB port.</p>
<p>The exception to this is, of course, being so familiar with a particular framework that you immediately know what changes you need to do without fighting against third party code.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.nicollet.net/2012/01/frameworks-libraries-conventions/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>On Escaping HTML</title>
		<link>http://www.nicollet.net/2011/10/on-escaping-html/</link>
		<comments>http://www.nicollet.net/2011/10/on-escaping-html/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 08:03:53 +0000</pubDate>
		<dc:creator>Victor Nicollet</dc:creator>
				<category><![CDATA[Dynamic]]></category>
		<category><![CDATA[Bugs]]></category>
		<category><![CDATA[HTML]]></category>
		<category><![CDATA[PHP]]></category>

		<guid isPermaLink="false">http://www.nicollet.net/?p=2572</guid>
		<description><![CDATA[A common issue with web software is cross-site scripting attacks — the ability for a third party to inject HTML elements into pages displayed to other users, using scripts contained in those elements to capture user cookies or perform operations on their behalf. The technical challenge in solving this is that whenever data is being [...]]]></description>
			<content:encoded><![CDATA[<p><img class="aligncenter size-full wp-image-2573" title="dome" src="http://www.nicollet.net/wp-content/uploads/2011/10/dome.png" alt="" width="675" height="100" /></p>
<p>A common issue with web software is cross-site scripting attacks — the ability for a third party to inject HTML elements into pages displayed to other users, using scripts contained in those elements to capture user cookies or perform operations on their behalf.</p>
<p>The technical challenge in solving this is that whenever data is being output through a HTML page, it should be escaped — any special HTML characters should be turned into their non-special versions in order to be displayed verbatim. This is an ongoing effort: each new page and each new variable on a page involve the same amount of effort to be done.</p>
<p>Of course, the solution would be to decide that <strong>escaping string output should be a default behavior that must be explicitly overriden</strong>. This does create issues where HTML is escaped when it should not have been, but:</p>
<ul>
<li>These issues cannot be used to perform attacks.</li>
<li>They are usually easier to reproduce and consequently to solve.</li>
<li>HTML <em>usually </em>comes from template files, which can be handled with a different default.</li>
</ul>
<p>Indeed, I can guarantee that my software has zero vulnerabilities related to escaped HTML, because I have built into the type system the fact that HTML always comes from templates, and the method that injects variables into templates escapes them. If I try to use a string as if it were HTML, I get a compiler error.</p>
<p>Even without a type system, one can guarantee that the system would rather break at runtime than allow an injection, using the exact same design, with incompatible data structures for templates and strings that blow up when a string is used as a template:</p>
<pre style="color: #000020; padding-left: 30px;"><code><span style="color: #200080; font-weight: bold;">class</span><span style="color: #000000;"> FilledTemplate </span><span style="color: #406080;">{</span>
<span style="color: #000000;">  </span><span style="color: #200080; font-weight: bold;">function</span><span style="color: #000000;"> </span><span style="color: #400000;">__construct</span><span style="color: #308080;">(</span><span style="color: #007d45;">$html</span><span style="color: #308080;">)</span><span style="color: #000000;"> </span><span style="color: #406080;">{</span>
<span style="color: #000000;">    </span><span style="color: #007d45;">$</span><span style="color: #200080; font-weight: bold;">this</span><span style="color: #308080;">-&gt;</span><span style="color: #007d45;">_html</span><span style="color: #000000;"> </span><span style="color: #308080;">=</span><span style="color: #000000;"> </span><span style="color: #007d45;">$html</span><span style="color: #406080;">;</span>
<span style="color: #000000;">  </span><span style="color: #406080;">}</span>
<span style="color: #000000;">  </span><span style="color: #200080; font-weight: bold;">function</span><span style="color: #000000;"> html</span><span style="color: #308080;">(</span><span style="color: #308080;">)</span><span style="color: #000000;"> </span><span style="color: #406080;">{</span>
<span style="color: #000000;">    </span><span style="color: #200080; font-weight: bold;">return</span><span style="color: #000000;"> </span><span style="color: #007d45;">$</span><span style="color: #200080; font-weight: bold;">this</span><span style="color: #308080;">-&gt;</span><span style="color: #007d45;">_html</span><span style="color: #406080;">;</span>
<span style="color: #000000;">  </span><span style="color: #406080;">}</span>
<span style="color: #406080;">}</span>

<span style="color: #200080; font-weight: bold;">class</span><span style="color: #000000;"> Template </span><span style="color: #406080;">{</span>
<span style="color: #000000;">  </span><span style="color: #200080; font-weight: bold;">function</span><span style="color: #000000;"> </span><span style="color: #400000;">__construct</span><span style="color: #308080;">(</span><span style="color: #007d45;">$file</span><span style="color: #308080;">)</span><span style="color: #000000;"> </span><span style="color: #406080;">{</span>
<span style="color: #000000;">    </span><span style="color: #007d45;">$</span><span style="color: #200080; font-weight: bold;">this</span><span style="color: #308080;">-&gt;</span><span style="color: #007d45;">_template</span><span style="color: #000000;"> </span><span style="color: #308080;">=</span><span style="color: #000000;"> </span><span style="color: #400000;">file_get_contents</span><span style="color: #308080;">(</span><span style="color: #007d45;">$file</span><span style="color: #308080;">)</span><span style="color: #406080;">;</span>
<span style="color: #000000;">  </span><span style="color: #406080;">}</span>
<span style="color: #000000;">  </span><span style="color: #200080; font-weight: bold;">function</span><span style="color: #000000;"> fill</span><span style="color: #308080;">(</span><span style="color: #007d45;">$values</span><span style="color: #308080;">)</span><span style="color: #000000;"> </span><span style="color: #406080;">{</span>
<span style="color: #000000;">    </span><span style="color: #007d45;">$replace</span><span style="color: #000000;"> </span><span style="color: #308080;">=</span><span style="color: #000000;"> </span><span style="color: #200080; font-weight: bold;">array</span><span style="color: #308080;">(</span><span style="color: #308080;">)</span><span style="color: #406080;">;</span>
<span style="color: #000000;">    </span><span style="color: #007d45;">$with</span><span style="color: #000000;">    </span><span style="color: #308080;">=</span><span style="color: #000000;"> </span><span style="color: #200080; font-weight: bold;">array</span><span style="color: #308080;">(</span><span style="color: #308080;">)</span><span style="color: #406080;">;</span>
<span style="color: #000000;">    </span><span style="color: #200080; font-weight: bold;">foreach</span><span style="color: #000000;"> </span><span style="color: #308080;">(</span><span style="color: #007d45;">$values</span><span style="color: #000000;"> </span><span style="color: #200080; font-weight: bold;">as</span><span style="color: #000000;"> </span><span style="color: #007d45;">$key</span><span style="color: #000000;"> </span><span style="color: #308080;">=</span><span style="color: #308080;">&gt;</span><span style="color: #000000;"> </span><span style="color: #007d45;">$value</span><span style="color: #308080;">)</span><span style="color: #000000;"> </span><span style="color: #406080;">{</span>
<span style="color: #000000;">      </span><span style="color: #007d45;">$replace</span><span style="color: #308080;">[</span><span style="color: #308080;">]</span><span style="color: #000000;"> </span><span style="color: #308080;">=</span><span style="color: #000000;"> </span><span style="color: #1060b6;">'{'</span><span style="color: #308080;">.</span><span style="color: #007d45;">$key</span><span style="color: #308080;">.</span><span style="color: #1060b6;">'}'</span><span style="color: #406080;">;</span>
<span style="color: #000000;">      </span><span style="color: #200080; font-weight: bold;">if</span><span style="color: #000000;"> </span><span style="color: #308080;">(</span><span style="color: #007d45;">$value</span><span style="color: #000000;"> </span><span style="color: #200080; font-weight: bold;">instanceof</span><span style="color: #000000;"> FilledTemplate</span><span style="color: #308080;">)</span><span style="color: #000000;"> </span>
<span style="color: #000000;">        </span><span style="color: #007d45;">$with</span><span style="color: #308080;">[</span><span style="color: #308080;">]</span><span style="color: #000000;"> </span><span style="color: #308080;">=</span><span style="color: #000000;"> </span><span style="color: #007d45;">$value</span><span style="color: #308080;">-</span><span style="color: #308080;">&gt;</span><span style="color: #000000;">html</span><span style="color: #308080;">(</span><span style="color: #308080;">)</span><span style="color: #406080;">;</span>
<span style="color: #000000;">      </span><span style="color: #200080; font-weight: bold;">else</span><span style="color: #000000;"> </span>
<span style="color: #000000;">        </span><span style="color: #007d45;">$with</span><span style="color: #308080;">[</span><span style="color: #308080;">]</span><span style="color: #000000;"> </span><span style="color: #308080;">=</span><span style="color: #000000;"> </span><span style="color: #400000;">htmlspecialchars</span><span style="color: #308080;">(</span><span style="color: #007d45;">$value</span><span style="color: #308080;">)</span><span style="color: #406080;">;</span>
<span style="color: #000000;">    </span><span style="color: #406080;">}</span>
<span style="color: #000000;">    </span><span style="color: #200080; font-weight: bold;">return</span><span style="color: #000000;"> </span><span style="color: #200080; font-weight: bold;">new</span><span style="color: #000000;"> FilledTemplate</span><span style="color: #308080;">(</span>
<span style="color: #000000;">      </span><span style="color: #400000;">str_replace</span><span style="color: #308080;">(</span><span style="color: #007d45;">$replace</span><span style="color: #308080;">,</span><span style="color: #007d45;">$with</span><span style="color: #308080;">,</span><span style="color: #007d45;">$</span><span style="color: #200080; font-weight: bold;">this</span><span style="color: #308080;">-&gt;</span><span style="color: #007d45;">_template</span><span style="color: #308080;">)</span>
<span style="color: #000000;">    </span><span style="color: #308080;">)</span><span style="color: #406080;">;</span>
<span style="color: #000000;">  </span><span style="color: #406080;">}</span><span style="color: #000000;"> </span>
<span style="color: #406080;">}</span></code></pre>
<p>Obviously, many languages and frameworks use non-escaped string output as the default behavior. This, in my opinion, is pure, broken insanity — I can certainly see that designing a safe way of constructing HTML is harder than just following the «HTML is strings, just use string functions» approach and telling the programmer to «always escape your variables, kid» but I still find it quite irresponsible for self-proclaimed Web Languages to rely on such a primitive and dangerous paradigm. The stupid kind of irresponsible. Yes, PHP, I&#8217;m looking at you.</p>
<p><small>Article Image © Freedom II Andres — <a href="http://www.flickr.com/photos/freedomiiphotography/6203083791/">Flickr</a></small></p>
]]></content:encoded>
			<wfw:commentRss>http://www.nicollet.net/2011/10/on-escaping-html/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Node.js is Aquarius</title>
		<link>http://www.nicollet.net/2011/10/node-js-is-aquarius/</link>
		<comments>http://www.nicollet.net/2011/10/node-js-is-aquarius/#comments</comments>
		<pubDate>Mon, 03 Oct 2011 08:07:15 +0000</pubDate>
		<dc:creator>Victor Nicollet</dc:creator>
				<category><![CDATA[Dynamic]]></category>
		<category><![CDATA[Architecture]]></category>
		<category><![CDATA[Node.js]]></category>
		<category><![CDATA[Web]]></category>

		<guid isPermaLink="false">http://www.nicollet.net/?p=2548</guid>
		<description><![CDATA[@Ted Dziuba : your article on Node.js being cancer has brought many angry nerds with pitchforks to your door. You do make good points, and the best opinion is not one that everyone blindly agrees with, but one that gets everyone thinking — hopefully before they speak. Scalability I, too, would take issue with a [...]]]></description>
			<content:encoded><![CDATA[<p><img class="aligncenter size-full wp-image-2549" title="ecluse" src="http://www.nicollet.net/wp-content/uploads/2011/10/ecluse.png" alt="" width="675" height="100" /></p>
<p>@<a href="http://teddziuba.com/2011/10/node-js-is-cancer.html" target="_blank">Ted Dziuba</a> : your article on Node.js being cancer has brought many angry nerds with pitchforks to your door. You do make good points, and the best opinion is not one that everyone blindly agrees with, but one that gets everyone thinking — hopefully before they speak.</p>
<h3>Scalability</h3>
<p>I, too, would take issue with a statement like «Node.js is scalable because it is non-blocking» though not the same issue as you took. Being <em>non-blocking</em> does not help with <em>scalability</em> at all. Scalability is about how easily your system administrator can add a new machine to your web farm to soak up a heavier load than usual, and it&#8217;s all about two things:</p>
<ul>
<li><strong>Can you run multiple copies of your software in parallel</strong>? In-application sharing of data makes this harder. Some Java servers store the entire relevant state in application memory, so scaling is impossible. PHP stores session files on the disk by default, so scaling is only possible with server affinity (the same user always gets sent to the same server). A clean server with no in-application data sharing is easily duplicated, regardless of the language.</li>
<li><strong>Is there a shared resource with sequential access</strong>? If you run a hundred thousand web servers, but all of them have to read-write the same physical drive, then your application will be no faster than that read-write speed. If you access a database that involves heavy locking, then your application will be no faster than the locking sequence can allow.</li>
</ul>
<p>None of these are in any way improved or even affected by non-blocking semantics.</p>
<p>Node.js improves <em>performance</em> when serving multiple concurrent requests. It makes it no easier to scale, but it helps delay the point where scaling becomes necessary.</p>
<p>The typical explanation of how this happens is that if serving a request uses 10ms of processing things on the server («Work») and 10ms of waiting for database requests to complete («Wait»), then the ideal web server should be able to serve two concurrent requests in 10ms each by overlapping the processing time of one request with the database wait time of another. This is a pretty nice and simple idea, which is why everyone has been doing it for ages. The main difference is how it is done.</p>
<p>What the traditional UNIX world did is pop enough processes — that is the Unix answer to every problem, <a href="http://en.wikipedia.org/wiki/Fork_bomb#Defusing" target="_blank">including having too many processes around</a>. If your Work-time is 10ms and your Wait-time is 40ms, then by allowing up to four processes you are effectively recycling all the wait-time in a high concurrent load situation. This is why every CGI- or FastCGI-enabled web server in existence provides a configuration entry for the number of concurrent child processes.</p>
<p>Node.js does the same. With that same Wait/Work ratio of 40/10, Node.js will be serving four concurrent requests at the same time, because it cannot create processing time out of thin air.</p>
<p>What Node.js brings to the table is an architecture that performs, at the server level, what the traditional UNIX world did at the kernel level: scheduling. Whether this approach is significantly faster than a properly configured FastCGI setup is still a matter of debate, and I believe the answer here is simply that, as long as the Wait/Work time ratio does not push the number of concurrent processes higher than what the available memory allows, there will be no significant difference between FastCGI and Node.js in terms of blocking.</p>
<h4>The UNIX Way</h4>
<p>I once agreed with your stated opinion on the matter, but I got better. Here&#8217;s the thing: today, being an HTTP server is no more of a «responsibility» than reading from STDIN and writing to STDOUT. Make no mistake: being a production, internet-facing HTTP server <em>is</em> a responsibility, but that is not what Node.js is (or should be) trying to achieve.</p>
<p>Consider this: the production, internet-facing HTTP server must communicate with the actual application using one protocol or another. CGI is one such protocol, FastCGI is another, and HTTP is yet another — the fact that the same protocol is used for serving requests over the internet is not  a problem, it is actually a benefit because communicating through HTTP is a solved problem with a clean API in almost every single language out there.</p>
<p>There is now something I would jokingly call «The REST Way» which follows in the tracks of the UNIX Way in a cloudy fashion : small applications performing one task — dispatching internet requests, constructing responses, persistent storage, caching — running on any number of servers in any number of locations, and connected to each other through HTTP requests. In an nginx-Node.js-CouchDB stack, nginx is the dispatcher, Node.js constructs responses, and CouchDB provides persistent storage, and everyone «speaks» HTTP in the same way that Unix processes «speak» STDIN/STDOUT.</p>
<p><small>Article image &copy; Patrick Janicek &mdash; <a href="http://www.flickr.com/photos/marsupilami92/5943144941/">Flickr</a></small></p>
]]></content:encoded>
			<wfw:commentRss>http://www.nicollet.net/2011/10/node-js-is-aquarius/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>jQuery Datepicker &#8211; the Instance Data bug</title>
		<link>http://www.nicollet.net/2011/09/jquery-datepicker-the-instance-data-bug/</link>
		<comments>http://www.nicollet.net/2011/09/jquery-datepicker-the-instance-data-bug/#comments</comments>
		<pubDate>Tue, 06 Sep 2011 10:49:44 +0000</pubDate>
		<dc:creator>Victor Nicollet</dc:creator>
				<category><![CDATA[Dynamic]]></category>
		<category><![CDATA[Bugs]]></category>
		<category><![CDATA[jQuery]]></category>

		<guid isPermaLink="false">http://www.nicollet.net/?p=2538</guid>
		<description><![CDATA[The jQuery UI datepicker does strange things with the DOM, which causes undocumented brittleness. For instance, consider the following operations on a page that contains a single input element: $('input').datepicker().attr('id','the-input'); This will cause no error, and clicking on the input will correctly summon the date picking dialog, but clicking on a date in that dialog [...]]]></description>
			<content:encoded><![CDATA[<p><img class="aligncenter size-full wp-image-2539" title="error" src="http://www.nicollet.net/wp-content/uploads/2011/09/error.png" alt="" width="675" height="100" />The jQuery UI datepicker does <em>strange</em> things with the DOM, which causes undocumented brittleness. For instance, consider the following operations on a page that contains a single input element:</p>
<pre style="padding-left: 30px;">$('input').datepicker().attr('id','the-input');</pre>
<p>This will cause no error, and clicking on the input will correctly summon the date picking dialog, but clicking on a date in that dialog will fail with the following error:</p>
<pre style="padding-left: 30px;">missing instance data for this datepicker</pre>
<p>The diagnosis is quite simple: the jQuery UI datepicker stores additional &#8220;instance data&#8221; based on the id attribute of the element, so changing the id attribute manually causes that instance data to be lost. This unexpected brittleness forced me to spend some time hacking my code so that the identifier is attributed <em>before</em> the datepicker is enabled, but at least this solved the problem.</p>
<p>Two related problems would be:</p>
<ul>
<li>If you have several input elements with the same identifier, and apply the datepicker on the second element, the search-by-id will return the first element and cause the same error as above.</li>
<li>If you apply the <code>hasDatepicker</code> CSS class on an element, and then apply the datepicker plugin, it will assume that the instance data has already been initialized, and will fail.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.nicollet.net/2011/09/jquery-datepicker-the-instance-data-bug/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Coping With Inconsistent Databases</title>
		<link>http://www.nicollet.net/2011/08/coping-with-inconsistent-databases/</link>
		<comments>http://www.nicollet.net/2011/08/coping-with-inconsistent-databases/#comments</comments>
		<pubDate>Fri, 05 Aug 2011 21:21:17 +0000</pubDate>
		<dc:creator>Victor Nicollet</dc:creator>
				<category><![CDATA[Dynamic]]></category>
		<category><![CDATA[Architecture]]></category>
		<category><![CDATA[CouchDB]]></category>

		<guid isPermaLink="false">http://www.nicollet.net/?p=2470</guid>
		<description><![CDATA[In my earlier article about the benefits of NoSQL, I discussed eventually consistent databases. These are databases where « write A ; read A » can return an outdated or missing value, but « write A ; wait ; read A » will always return the correct value if you wait long enough. Dealing with [...]]]></description>
			<content:encoded><![CDATA[<p><img class="aligncenter size-full wp-image-2486" title="clock" src="http://www.nicollet.net/wp-content/uploads/2011/08/clock.png" alt="" width="675" height="100" /></p>
<p>In <a href="http://www.nicollet.net/2011/07/nosql-is-a-premature-optimization/" target="_blank">my earlier article about the benefits of NoSQL</a>, I discussed eventually consistent databases. These are databases where « write A ; read A » can return an outdated or missing value, but « write A ; wait ; read A » will always return the correct value if you wait long enough. Dealing with eventual consistency can lead to bugs, because there are many pitfalls caused by race conditions. It&#8217;s impossible for anyone to avoid race conditions by reading the code and thinking very hard about it. Instead, the code must be written using patterns and <em>mental tools</em> that by their very design prevent race conditions from happening. My point was that most programmers that only had experience with the absolute-consistency SQL world do not have the mental tools necessary to avoid those pitfalls. Not because they are incapable of it, but because they never had the training or the experience to acquire these mental tools.</p>
<p>Today, an anonymous coward shared a few thoughts on the topic :</p>
<blockquote><p>They do not have the mental tools required to work with eventual consistency?<br />
The only mental tool I’ve seen is disregard for the issue.<br />
Waiting eagerly on another post discussing those “mental tools”.</p></blockquote>
<p>He/she is right, what <em>are</em> those mental tools anyway ?</p>
<p>First, let me state the obvious again : eventually consistent databases almost never remain inconsistent long enough for users to notice and, even if they do notice, they usually don&#8217;t care — through the prevalence of cache-powered websites, our users are used to seeing stale data every so often and know to hit the refresh button to deal with it. Aside from a few critical edge cases like online payment processing, <strong>the problem with eventual consistency is not the user</strong>.</p>
<p>The problem is that software makes decisions based on available data and, if the available data is wrong, then the outcome is wrong. This decision-making process will turn a one-nanosecond inconsistency into a permanent error if you are unlucky, and the entire point of this article is how to prevent this from happening. Need an example?</p>
<h3>Event-Based vs State-Based</h3>
<p>Let&#8217;s say I&#8217;m writing a badge module similar to the one used on <a href="http://stackoverflow.com/badges" target="_blank">Stack Overflow</a>. Here are the specifications:</p>
<blockquote><p>The user can publish articles. Their 10th article will bear a bronze badge, their 50th will bear a silver badge, and their 100th article will bear a golden badge.</p></blockquote>
<p>One way I can write this module is to intercept the «publish article» event and add my own bit of logic to it: if there are nine other articles, award the bronze badge. This is an event-based approach, because it performs some changes when an event happens. This way of doing things is almost universally followed in the SQL world, but it does not work in NoSQL environments that lack absolute consistency.<strong><br />
</strong></p>
<p><strong>What&#8217;s the problem?</strong> One user, Bob, tries to cheat the system by publishing nine articles, then publishing articles X and Y in quick succession, hoping to get bronze badges for both. The behavior we want is that X should have the bronze badge and Y should not.</p>
<ul>
<li>If absolute consistency is guaranteed, then Y will be published when the database already knows that X has been published, it will be the 11th and thus will not receive the badge.</li>
<li>If only eventual consistency is guaranteed, then Y might be published before the existence of X has been acknowledged : both articles would receive badges.</li>
</ul>
<p>The alternative is to use a state-based architecture where «On EVENT apply CHANGE» is replaced by «If STATE-A then STATE-B» : instead of «On publishing the tenth article, award badge» the system uses «If this is the tenth article, then it has the bronze badge.» Where an event-based solution would apply the CHANGE and move on, the state-based solution instead examines STATE-A whenever someone asks for STATE-B and applies the rule every single time.</p>
<p>Going back to Bob&#8217;s problem : if you ask a few nanoseconds after both articles are published «Does article Y have the bronze badge?» then the answer will still be «Yes» because eventual consistency takes a short while to set in. But if you ask the same question a few seconds later, then article Y will be correctly known as being the 11th article and the answer will be «No»</p>
<p>An application that is entirely based on state-based rules can work with an eventually consistent database without ever having permanent errors — by definition, any errors would only last as long as the underlying inconsistencies remain. In practice, from my experience with CouchDB, all temporary errors are gone after a couple of seconds in the very worst case, and it&#8217;s usually gone before that.</p>
<p>But state-based rules do mean that whenever the application needs to know STATE-B, it must read STATE-A and apply the rule again. Does this mean that I will have to count the articles (a potentially costly operation) whenever I need to know if a given article has the bronze badge? This is pure insanity!</p>
<h3>State-Based Caches</h3>
<p>The NoSQL answer is «Cache it!»</p>
<p>In fact, I will go even further: a NoSQL-friendly architecture eliminates several downsides of caching while keeping all the performance benefits, in ways that no event-based SQL solution can.</p>
<ul>
<li>Staleness of cached data is not an issue: the software is already designed to deal with eventual consistency and a cache is just another kind of eventually consistent data source. Unlike traditional software that relies on absolute consistency, NoSQL-friendly applications can make business decisions based on cached data without any risk.</li>
<li>Dependencies between STATE-A and STATE-B are usually first-class citizens of the application source code, so when a state change happens it&#8217;s easy to follow the threads and invalidate all the dependencies. The application can rely on invalidation instead of timeouts to keep the cache up-to-date.</li>
<li>Most NoSQL solutions already provide some level of caching. For instance, counting the number of published articles in CouchDB is <a href="http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Reduced_Value_Sizes" target="_blank">a constant-time cached operation</a>, and the database keeps the cache up-to-date without developer intervention. In fact, manual caching is almost never a requirement for simple rules in CouchDB — and even then, the database provides a &#8220;last changes&#8221; real-time feed that the developer can use to make cache management easier.</li>
</ul>
<p>It interesting to note that several common patterns in SQL event-based applications are in fact poor implementations of a caching strategy for a state-based rule. An upvote/downvote system such as the one <a href="http://www.reddit.com/" target="_blank">Reddit</a> uses involves storing both the number of votes in the <em>item</em> table, and the individual votes in an <em>user-comment</em> association table — the former is used to quickly determine the current score of an item, while the latter is used to prevent people from voting several times. The state-based query implemented here is :</p>
<p style="padding-left: 30px;"><code>SELECT SUM(score) FROM votes WHERE item_id = ?<br />
</code></p>
<p>However, the naive event-based solution is to intercept &#8220;upvote&#8221; and &#8220;downvote&#8221; events and perform this query instead:</p>
<p style="padding-left: 30px;"><code>UPDATE item SET score = score + 1 WHERE item_id = ?</code></p>
<p>This is done in the hopes that the sequence of of +1&#8242;s and -1&#8242;s will remain equivalent to the original state-based query, which is only the case if upvotes and downvotes are the only events that affect the votes table. If, say, banning an user account retroactively deletes all the associated votes, it would take another ad hoc query to keep the cache correct. Maybe something like this:</p>
<p style="padding-left: 30px;"><code>UPDATE item NATURAL JOIN vote SET score = item.score - vote.score<br />
WHERE vote.user_id = ?</code></p>
<p>This is because of a fundamental difference between event-based and state-based designs : if your value actually depends on the state, then it takes one state-based piece of code to compute it, but it takes one event-based piece of code<em> for each possible event that could ever affect it</em>.</p>
<p>And even then, you still have to write the state-based update code because you will need to run it to rebuild the cache whenever something goes wrong.</p>
<h3><strong>Typical State-Based Architecture</strong></h3>
<p>There are three kinds of rules in any application :</p>
<ul>
<li>State-based rules : when this value is X, that value is F(X). Most <em>indirect</em> consequences of user input are here.</li>
<li>Event-based input rules : when this event happens in the real world, do X. This could be caused by user input, or when communicating with a third party API.</li>
<li>Event-based output rules : when this happens in the application, perform X in the real world. The classic example is sending an e-mail, but this covers <em>pushing</em> any kind of data to anyone outside your application.</li>
</ul>
<p><strong>State-based rules</strong> can be handled natively.</p>
<p><strong>Input rules</strong> are usually handled by performing an <em>atomic, non-conflicting</em> write to the database whenever the event happens — it should be done in such a way that no conflict can happen after the event has passed. One solution is to simply create a new document with an unique identifier every time an event happens: unique identifiers prevent conflicts, and you can then rely on state-based rules to aggregate a sequence of events into a more coherent current state. In my current project, every notification received from PayPal is appended to a database, and a state-based rule aggregates those notifications into a pending-failed-successful state for every transaction. As an added bonus this solution also provides a history (the list of related events) and the possibility to <em>cancel</em> events by deleting the corresponding document in the same way that one can revert a Wikipedia article to a previous version by removing the corresponding diffs.</p>
<p>Another solution for handling input rules is useful when the user <em>sets</em> a value — what matters to the user is the resulting value, not the operation that resulted in that value. If setting this value can be done by an <em>atomic, non-conflicting</em> update, then do so. Keep in mind that if you use CouchDB master-master replication, then updates are <em>not</em> non-conflicting !</p>
<p><strong>Output rules</strong> are trickier. If you are lucky, your output rule is in fact tied to an input event such as «When you click this button, I will ask Paypal for your money» and this can in fact be handled as a normal input rule that just happens to query a third party API for more input data.</p>
<p>Application-initiated output events involve creating an entry that represents the outgoing event before it happens, with a timestamp of the moment the event should happen, appropriately set some time into the future. That entry is then managed by standard state-based rules that can alter it or disable it as part of the corresponding source data eventually becoming consistent. The delay should be calculated to ensure that the database does become consistent, and a delay of few minutes is not a problem because the action was not initiated by the user. Once the delay expires, the application reads back the entry and performs the output action if it is still appropriate.</p>
<p>Back to Bob&#8217;s articles : let&#8217;s say the specifications require that I send Bob a congratulatory e-mail whenever an article gets a badge. Be cause he cheated, the state-based rule determines mistakenly that Bob&#8217;s articles X and Y both received a bronze badge, so it creates two entries in the «congratulatory e-mail» section, both set one minute into the future.</p>
<p>The trick here is that the identifiers of those entries are something along the lines of &#8220;Bronze-Badge-Y&#8221; so that applying the state-based rule several times merely updates the same entry instead of creating a new one every time. After a few seconds, the eventual consistency catches up with Bob and article Y loses its bronze badge status. The rule-based system detects that the &#8220;Bronze-Badge-Y&#8221; entry needs to be updated and marks it as «do not send».</p>
<h3>User Uncertainty</h3>
<p>Earlier, I skimmed over the fact that users don&#8217;t care about eventual consistency. There&#8217;s one exception to this rule — when you&#8217;re asking users to make a decision based on data you are showing them, you cannot afford to go wrong.</p>
<p>If you ask your user whether they wish to pay $100, and you bill them $101 instead because the price changed in the database while the user was reading the confirmation form, then you have a problem.</p>
<p>This problem, however, is not specific to the NoSQL eventual consistency world. In fact, the average SQL application has the same problem: it&#8217;s impossible to start a transaction, show the user a confirmation form, and only end the transaction when the user confirms. Transactions do not work that way. Instead, both SQL and NoSQL solutions must resort to a conflict detection strategy: when the user confirms, check whether the user&#8217;s decision is still compatible with the application state and if it isn&#8217;t, show them an error message — «Sorry, the price just went up to $101, do you still want to go on?»</p>
<p>It is possible to detect conflict using state-based rules in an eventually consistent database: entry A, created when the user confirmed the payment, states that $100 should be billed, but entry B created a few seconds before entry A states that the price is now $101. The problem is that it might take a short while for entries A and B to be processed together, but we need to show a confirmation page straight away&#8230;</p>
<p>You have two possibilities here. The first is the most obvious one: have the user wait until the eventual consistency kicks in and you can genuinely confirm their purchase; you may optimise your NoSQL usage to make that delay shorter, such as by avoiding master-master replication on that particular database.</p>
<p>The second possibility, for which I have a personal preference, is to provide an answer straight away, but reserve the right to deny that decision later. This means that in 99% of the cases, there is no conflict and the user does not have to wait. In 99% of the remaining cases, the user waited long enough on the confirmation page that the conflict is detected straight away. It really takes a stroke of bad luck for the user&#8217;s decision to happen precisely as the situation changes, so having to cancel in those specific cases is acceptable, especially since your state-based architecture can handle the cancellation quite well.</p>
<p>This is no different than having to cancel an e-commerce order because the ordered item was lost at the warehouse — the computer said yes, but reality said no.</p>
<h3>TL ; DR</h3>
<ol>
<li>An UPDATE is <em>permanently</em> inconsistent if it was based on <em>temporarily</em> inconsistent data.</li>
<li>The result of a CREATE is never <em>permanently</em> inconsistent.<br />
So, don&#8217;t UPDATE objects, CREATE object <em>modifications</em>.</li>
<li>To get the latest version of an object, apply a map-reduce algorithm to the modifications.</li>
<li>You should cache data, the cache must be re-calculated whenever the underlying data changes.</li>
<li>Some UPDATEs are in fact hidden cache refreshes. Use a normal cache instead.</li>
<li>When affecting the outside world, wait for the eventual consistency to kick in before you act.</li>
<li>Conflicts can affect users, but only rarely. Plan your UI accordingly.</li>
</ol>
<p><small>Article Image &copy; Chris Dlugosz &mdash; <a href="http://www.flickr.com/photos/chrisdlugosz/4324706280/">Flickr</a></small></p>
]]></content:encoded>
			<wfw:commentRss>http://www.nicollet.net/2011/08/coping-with-inconsistent-databases/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Detailed Usage Statistics</title>
		<link>http://www.nicollet.net/2011/07/detailed-usage-statistics/</link>
		<comments>http://www.nicollet.net/2011/07/detailed-usage-statistics/#comments</comments>
		<pubDate>Mon, 25 Jul 2011 00:23:59 +0000</pubDate>
		<dc:creator>Victor Nicollet</dc:creator>
				<category><![CDATA[Dynamic]]></category>
		<category><![CDATA[CouchDB]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[Profiling]]></category>
		<category><![CDATA[RunOrg]]></category>

		<guid isPermaLink="false">http://www.nicollet.net/?p=2445</guid>
		<description><![CDATA[Use experience is heavily influenced by how fast the application or web site reacts. It makes sense to set up tools that help the development team detect performance issues and correct them. I&#8217;ve shamelessly pilfered the basic ideas illustrated by Jeff Atwood last month : That&#8217;s why, as a developer, you need to put performance [...]]]></description>
			<content:encoded><![CDATA[<p><img class="aligncenter size-full wp-image-2448" title="stats-header" src="http://www.nicollet.net/wp-content/uploads/2011/07/stats-header.png" alt="" width="675" height="100" /></p>
<p>Use experience is heavily influenced by how fast the application or web site reacts. It makes sense to set up tools that help the development team detect performance issues and correct them. I&#8217;ve shamelessly pilfered the <a href="http://www.codinghorror.com/blog/2011/06/performance-is-a-feature.html" target="_blank">basic ideas illustrated by Jeff Atwood last month</a> :</p>
<blockquote><p>That&#8217;s why, as a developer, you need to put performance right in front  of your face on every single page, all the time. That&#8217;s exactly what we  did with our <a href="http://code.google.com/p/mvc-mini-profiler/">MVC Mini Profiler</a>, which we are contributing back to the world as open source. The simple act of <strong>putting a render time in the upper right hand corner of every page we serve</strong> forced us to fix all our performance regressions and omissions.</p></blockquote>
<p>When logged in as a member of the team, every request sent to the server, including all AJAX requests, display a small summary of what happened during that request:</p>
<p><img class="aligncenter size-full wp-image-2446" title="after-optim" src="http://www.nicollet.net/wp-content/uploads/2011/07/after-optim.png" alt="" width="650" height="197" /></p>
<p>In addition to these real-time stats, the server also saves this profiling data to the database (at a cost of an additional millisecond or two). That database is then sliced up into atomic operations like &#8220;Get Picture&#8221;or &#8220;Get Item&#8221;, averaged, and consolidated into charts:</p>
<p><img class="aligncenter size-full wp-image-2447" title="graphs" src="http://www.nicollet.net/wp-content/uploads/2011/07/graphs.png" alt="" width="544" height="782" /></p>
<p>The operations are sorted by Impact = Duration × Frequency, which happens to be an accurate approximation of how much performance could be gained by optimizing a given operation. The database load is not a duration, it&#8217;s a summary of how many requests were performed and how many bytes were sent or received, but it does not reflect how long it took the database to actually process the request.</p>
<p>This chart tells me that, among all atomic operations, accessing a picture is the most important one — which is quite expected, because every page contains several user pictures: even though the operation is very fast, it&#8217;s executed extremely often. So, it would be possible to slash request times across the board by either de-normalizing the picture URL into the user profile (so that both the name and picture URL are loaded from the database in one read) or by storing the picture-identifier-to-URL mapping in a distributed cache such as memcache.</p>
<p>The second most important operation would be user identification: determining the relationship between an user and the workspace they are trying to access, which happens on every single HTTP request. Whereas grabbing a picture is a simple fetch-by-identifier from CouchDB, user identification involves accessing a view with <code>include_docs=true</code>, which is slower. The optimization I am considering is to provide the user with a cookie that contains the identifier of their relationship to a given workspace, which is easy because every workspace has its own sub-domain, and run a fetch-by-identifier from CouchDB instead of a view access (of course, if the cookie is incorrect or missing, the server will fall back on a view access). From my experience with similar situations, this would cut the execution time in half.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.nicollet.net/2011/07/detailed-usage-statistics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>NoSQL Is A Premature Optimization</title>
		<link>http://www.nicollet.net/2011/07/nosql-is-a-premature-optimization/</link>
		<comments>http://www.nicollet.net/2011/07/nosql-is-a-premature-optimization/#comments</comments>
		<pubDate>Sat, 23 Jul 2011 12:32:15 +0000</pubDate>
		<dc:creator>Victor Nicollet</dc:creator>
				<category><![CDATA[Dynamic]]></category>
		<category><![CDATA[Architecture]]></category>
		<category><![CDATA[CouchDB]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Productivity]]></category>

		<guid isPermaLink="false">http://www.nicollet.net/?p=2441</guid>
		<description><![CDATA[Or so Bob Warfield writes. I happen to agree with the title — optimization using NoSQL means using a server cluster to split the load and scale up, and such an optimization is premature unless you are already having the millions of visits it takes to feel growing pains. If I start off on a [...]]]></description>
			<content:encoded><![CDATA[<p><img class="aligncenter size-full wp-image-2442" title="bulb" src="http://www.nicollet.net/wp-content/uploads/2011/07/bulb.png" alt="" width="675" height="100" /></p>
<p><a href="http://smoothspan.wordpress.com/2011/07/22/nosql-is-a-premature-optimization/" target="_blank">Or so Bob Warfield writes</a>. I happen to agree with the title — optimization using NoSQL means using a server cluster to split the load and scale up, and such an optimization is premature unless you are already having the millions of visits it takes to feel growing pains. If I start off on a new project and decide «<em>I&#8217;m going to use NoSQL so that it will scale when my project will have millions of users</em>» then I am prematurely assuming that your initial NoSQL strategy will fit the actual million-user scenario that will come up years from now. In fact, the bottleneck will probably be in a feature I didn&#8217;t even think of yet, and making it work will probably involve changes in the persistence model. But Bob Warfield goes further than the premature optimization argument:</p>
<blockquote><p><strong>Point 2:  There is no particular advantage to NoSQL until you  reach scales that require it.  In fact it is the opposite, given Point  1.</strong></p>
<p>It’s harder to use.  You wind up having to do more in your  application layer to make up for what Relational does that NoSQL can’t  that you may rely on.  Take consistency, for example.  As Anand says in  his video, “Non-relational systems are not consistent.  Some, like  Cassandra, will heal the data.  Some will not.  If yours doesn’t, you  will spend a lot of time writing consistency checkers to deal with it.”   This is just one of many issues involved with being productive with  NoSQL.</p></blockquote>
<p>My current SaaS project pivoted from MySQL to CouchDB nearly at the beginning, certainly before we had any customers or any features worth showing. My greatest fear when settling on CouchDB was that I would have to <em>work around</em> the NoSQL lack of transactions, joins, consistency or whatever else you expect from a database system.</p>
<p>I was sorely mistaken, and so is Bob Warfield.</p>
<p>Even though NoSQL fails to solve many of the <em>low-level</em> problems that SQL eats for breakfast, this does not make it incapable of solving the same <em>high-level</em> problems as traditional relational strategies, you just need to understand how to do it, in the same way that you had to understand relational algebra, joins, indexes and transactions before doing anything worthwhile with <strong> </strong>SQL. Coming to NoSQL and expecting to solve your problems with those same strategies that you used in the relational world is as silly as using a hammer to drive screws in.</p>
<p>For instance, CouchDB has no global consistency, only eventual consistency — there is an <em>inconsistency window</em> where state spread across multiple documents can be inconsistent. This will make any relational programmer scream bloody murder. And yes, if you absolutely and positively need to have that state stay consistent, then you will need some application-side code to do it, and it will ruin your productivity.</p>
<p>But most applications don&#8217;t <em>need</em> global consistency, in fact an inconsistency window of a few seconds is acceptable in most situations. It is the programmers who need global consistency, because they do not have the mental tools required to work with eventual consistency. But once you get the hang of it, there is no working around, no overhead, no additional steps or checks required to make your application work. It is a different route, but not a longer one.</p>
<p>In addition to the above, from my experience, <strong>there are clear and significant benefits to using CouchDB over MySQL that are not related to scalability or performance</strong>. These benefits may well be useless to your specific situation, but they do exist.</p>
<h3>1. Schema changes are painless and non-locking</h3>
<p><a href="http://highscalability.com/blog/2010/5/17/7-lessons-learned-while-building-reddit-to-270-million-page.html" target="_blank">This (lesson 3)</a> is what brought me to NoSQL in the first place.</p>
<p>CouchDB does not implement a schema in the way an SQL product rigidly delineates tables, columns and relationships. Of course, it would be foolish to actually have no schema concept at all, so there is a dedicated schema layer in our application architecture that describes what the CouchDB &#8220;tables&#8221; look like, in terms of serialization and deserialization. Schema changes are therefore a simple change to the deserialization process, which needs to be able to read the old data format.</p>
<p>For simple changes, such as adding a field with a constant value, no work is required as the deserialization layer can fill in the missing field on the fly. For complex changes that involve application-provided data, such as adding a &#8220;file size&#8221; field that needs to be initialized with the actual file size, there is a clear benefit to having the application itself perform the schema change, as opposed to application-independent ALTER scripts.<strong></strong></p>
<h3>2. Document contents can be dynamic</h3>
<p>This was the actual reason we settled on CouchDB: our application lets users add their own custom fields to objects, and then filter/sort based on these fields. This requires almost no programming effort (aside, of course, from the user interface involved in doing so) and is nearly as efficient as using static programmer-provided fields.</p>
<p>I have had in the past some experience with managing arbitrary fields on a SQL platform, mostly when I was working with open source e-commerce platform Magento. Dynamic fields involve some significant boilerplate (such as entity-attribute-value tables) and clever tricks to perform filtering efficiently.</p>
<h3>3. The application-database impedance is lower</h3>
<p>A typical SQL schema contains two kinds of relationships: natural relationships such as «<em>an article has an author</em>» between two entities that can and will usually be queried independently, and accidental relationships such as «<em>an article has several tags</em>» that are only present because SQL cannot store the tags in the post table. As such, extracting a post from an SQL database counter-intuitively requires one query to grab the post itself, and another query to grab its tags.</p>
<p>CouchDB does away with accidental relationships completely by storing JSON documents. While this might allow a performance in some cases, the main benefit is that object <em>composition</em> as described by the programmer in the application code is persisted intuitively, without jumping through the intellectual hoops typical in relational storage.<strong></strong></p>
<h3>4. An identifier-centric application architecture is possible<strong><br />
</strong></h3>
<p>What does it mean to be identifier-centric or object-centric? A function to get the full URL of an article, in an <em>object-centric</em> application, is a function that takes an article object as an argument (or possibly a member function of the article object) and returns the article&#8217;s full URL. In an identifier-centric application, it would be a function that takes an article identifier as an argument (or possibly a member function of the article identifier class) and returns the full URL.</p>
<p>Identifier-centric architectures have major design benefits over object-centric ones, with clear consequences in terms of productivity and correctness, but have a major performance problem as the <em>same</em> data is read from the database several times unless some very complex caching strategies are applied — that data might be read using a quite complex SQL query that is hard to keep in cache correctly.</p>
<p>From my experience, the vast majority of queries in a CouchDB application will either query a document by its identifier, or query a view for several key-identifier-document pairs. In short, most of the data manipulated by the application can be easily traced back to an identifier without any specific design effort. And get-document-by-id requests are far easier to cache and optimize than arbitrary SELECT requests, both at the application level (we have a temporary cache that lasts the lifetime of the HTTP request) and with key-value caches like Memcache.</p>
<p>This may sound like a performance argument, but it isn&#8217;t, or at least not in the traditional «<em>NoSQL is faster than SQL</em>» sense. It just means that using NoSQL makes an identifier-centric architecture <em>acceptable</em> in terms of performance.</p>
<p><small>Article image © Satoru Kikuchi — <a href="http://www.flickr.com/photos/satoru_kikuchi/4461605065/">Flickr</a></small></p>
]]></content:encoded>
			<wfw:commentRss>http://www.nicollet.net/2011/07/nosql-is-a-premature-optimization/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>The Art of Development Time Estimates (Part 1)</title>
		<link>http://www.nicollet.net/2011/07/time-estimates-1/</link>
		<comments>http://www.nicollet.net/2011/07/time-estimates-1/#comments</comments>
		<pubDate>Thu, 14 Jul 2011 13:07:35 +0000</pubDate>
		<dc:creator>Victor Nicollet</dc:creator>
				<category><![CDATA[Dynamic]]></category>
		<category><![CDATA[Productivity]]></category>

		<guid isPermaLink="false">http://www.nicollet.net/?p=2436</guid>
		<description><![CDATA[Writing software takes time, and time is money both in terms of programmer wages and in terms of delayed releases. It makes sense to try and predict ahead of time how long a given feature would take, in order to make an informed decision about whether it should be attempted, reduced or eliminated. If your [...]]]></description>
			<content:encoded><![CDATA[<p><img class="aligncenter size-full wp-image-2437" title="tree" src="http://www.nicollet.net/wp-content/uploads/2011/07/tree.png" alt="" width="675" height="100" /></p>
<p>Writing software takes time, and time is money both in terms of programmer wages and in terms of delayed releases. It makes sense to try and predict ahead of time how long a given feature would take, in order to make an informed decision about whether it should be attempted, reduced or eliminated. If your job is to predict durations, make sure you understand whether you are expected to provide a back-of-the-envelope <em>approximation</em> — with the implication that it could be wrong by an order of magnitude in both directions — or if you&#8217;re going for a  <em>guarantee</em> — this feature will cost no more than X days of work, unless something really catastrophic occurs, which is what a paying customer wants to know.</p>
<p>If your co-workers ever start using your approximations to define milestones, prepare deadlines and discuss delays, you have not insisted enough on the fact that it was an approximated answer. If anyone asks me for an on-the-fly estimate, I provide an upper and lower bound as &#8220;this will take somewhere between 2 and 10 days&#8221;. This is an outrageously wide range, but it&#8217;s fairly correct in terms of how wrong I can be with my on-the-fly estimates, and it deters anyone from just adding up the estimates to come up with a deadline. Yes, some people have tried converting &#8220;between 2 and 10 days&#8221; to &#8220;around 5 days&#8221; but I gave them the evil eye every single time. If anyone needs to turn an approximation into a guarantee, there is no sane reason to use anything but the upper bound.</p>
<h3>What Could Go Wrong ?</h3>
<p>There&#8217;s a fairly common mistake to be made with the upper bounds, and I&#8217;ve made it myself quite a lot: not being pessimistic enough. We&#8217;re lazy humans, so being optimistic is natural: we come up with a few tasks that need to be done, slap a reasonable duration on each one, and add them up. The upper bound is then pulled out of a top hat as being two to three times higher than the lower bound, because <em>that feels right</em>. Quite to the contrary, the upper bound must be calculated by actively looking for those things that can go wrong. Newbie programmers can usually provide a fairly accurate optimistic estimate because knowing <em>what needs to be done</em> is a prerequisite of being a programmer at all, but the pessimistic estimate requires knowledge of <em>what can go wrong</em>, which by definition is an esoteric list of accidents gathered from experience rather than rational forethought:</p>
<ul>
<li>The feature involves changing some code that is unusually brittle or unstable, so time will be needed to either pay the technical debt up front and bring that code back to acceptable quality levels, or soak up the cost of hunting for bugs after the code has been changed. This is the most frequent issue I encounter when dealing with changes to existing software, because not all code is of equal quality regardless of how much effort you put into it.</li>
<li>A library does exist, but preliminary analysis failed to observe that it only supports 95% of the required feature set, so additional time is necessary to obtain the missing 5%. Several internet-facing modules in one of my recent projects use a standard library for doing HTTP requests, but I discovered late during development that said library did not support HTTPS, which prompted me to include a second library, and incur technical debt related to having two overlapping libraries in the same project.</li>
<li>The library does fulfill all requirements, but happens to contain an obscure bug that prevents the feature from working as expected, so more time is spent trying to work around the bug and get the library authors to fix it. This is especially nasty when no replacement is possible, such as <a href="http://stackoverflow.com/questions/6549648/strange-error-message" target="_blank">errors in database servers</a>.</li>
<li>The code works as written, but QA testing reveals massive performance issues on typical user input, and time is required to correct the issue. On an older project, I used a <a href="http://www.fyneworks.com/jquery/star-rating/" target="_blank">jQuery plugin</a> for handling five-star ratings, with a single rating component costing 300 milliseconds in initialization — nothing noticeable on our test pages where only one component was used, but it brought the page load time to an unacceptable three seconds because users created feedback polls with dozens of such components.</li>
<li>The programmer who implemented the first half of the feature is ill, on vacation, fired, fighting fires on another project, demotivated, stuck in the snow or otherwise unavailable. Another developer is brought in and needs to spend some time getting familiar with the half-completed code (and getting that uncommitted code from the unavailable developer&#8217;s laptop was, in itself, a delay).</li>
<li>The programmer who implemented the feature delivered an incomplete buggy product several days late.</li>
<li>The programmer misunderstood the requirements and implemented the wrong feature.</li>
<li>An unforeseen edge case is detected that has severe consequences on the application architecture. For instance, a given server-side process was assumed to be synchronous but is discovered to be asynchronous with latencies of several minutes on high server load. This makes the original plans for a five-second loading page obsolete, and calls for a costlier, asynchronous &#8220;we&#8217;ll start working on this and notify you when we&#8217;re done&#8221; user interface strategy instead.</li>
</ul>
<p>The list goes on. Think of it as a shopping list you can go through when coming up with a pessimistic upper bound — start with the lower bound and add possible accidents.</p>
<p>Announcing a &#8220;between 2 and 10 days&#8221; range out of the blue can sound ridiculous, but it is quite less so when it&#8217;s actually backed by a list potential problems. Eight days spent working around library issues, obscure edge cases and performance problems is actually pretty normal from my own experience if these problems <em>do</em> come up.</p>
<p>Stay tuned for the next issue, where I will discuss how to work with your team and your stakeholders to lower those estimates.</p>
<p><small>Article image © Alexandre Pereira</small><small> — </small><small> <a href="http://www.flickr.com/photos/apr77/5927639523/">Flickr</a></small></p>
]]></content:encoded>
			<wfw:commentRss>http://www.nicollet.net/2011/07/time-estimates-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Dealing With Huge Projects</title>
		<link>http://www.nicollet.net/2011/05/dealing-with-huge-projects/</link>
		<comments>http://www.nicollet.net/2011/05/dealing-with-huge-projects/#comments</comments>
		<pubDate>Thu, 26 May 2011 10:30:05 +0000</pubDate>
		<dc:creator>Victor Nicollet</dc:creator>
				<category><![CDATA[Dynamic]]></category>
		<category><![CDATA[Architecture]]></category>
		<category><![CDATA[Objective Caml]]></category>
		<category><![CDATA[Productivity]]></category>

		<guid isPermaLink="false">http://www.nicollet.net/?p=2392</guid>
		<description><![CDATA[Right now, I&#8217;m the only developer working on RunOrg, which happens to be a 45k-line project written in OCaml. According to a common terseness observation, this is equivalent to managing a 135k-line project in Java. Alone. OCaml shares a problem with many dynamic languages : it&#8217;s very expressive, but there is no general consensus on [...]]]></description>
			<content:encoded><![CDATA[<p><img class="aligncenter size-full wp-image-2393" title="cake" src="http://www.nicollet.net/wp-content/uploads/2011/05/cake.png" alt="" width="675" height="100" /></p>
<p>Right now, I&#8217;m the only developer working on RunOrg, which happens to be a 45k-line project written in OCaml. According to a common terseness observation, this is equivalent to managing a 135k-line project in Java. Alone.</p>
<p>OCaml shares a problem with many dynamic languages : it&#8217;s very expressive, but there is no general consensus on what architectural best practices should be, so there are literally dozens of different ways a given feature might be implemented that cannot be discriminated on anything but taste. This leads to a variety of unique design choices throughout the application which, despite working well with each other, cause programmers to «discover» new architectures every time.</p>
<p>In the end, I believe that the philosophy of <strong>using the best tool for every job</strong> can easily be taken to painful extremes if you are not careful. You encounter a new problem, pick an unusual but well-adapted solution, <em>and it makes perfect sense to you</em>, so you move on. Months later, you come back and the solution does not make sense anymore because you have forgotten a small detail about how it works or why it was done this way, and you have to hunt that small detail down by reading the code. I&#8217;ve pretty much solved this anti-pattern, so I&#8217;ll come back to it later.</p>
<p>The main point I&#8217;m making here is that for every project, there is an ideal mudball of code that happens to perfectly implement everything without bugs, all in a single gigantic file, and you cannot write this mudball. For a human, there&#8217;s no way to manage anything mudballish past a few hundred lines because you cannot wrap your mortal mind around the possibility that every line might interact with any other line in the project&#8230; so, as an architect, you slice up the mudball into more acceptable bits that you politely call «modules» in order to reduce the number of things any given line might interact with. We reduce the amount of data we need to cope by adding big «you don&#8217;t need to think about this» signs everywhere (and making sure the signs don&#8217;t lie, obviously).</p>
<p>And on a small scale, this works, because you only have a dozen modules and it&#8217;s enough to fit in your short-term memory. RunOrg currently has 260 high-level modules, and several times that amount in sub-modules. No UML design, no matter how comprehensive, can make all those modules fit in my mind at once. I must find some «you don&#8217;t need to think about this» signs before I can move on.</p>
<p>There are mostly two ways of slicing a given project into modules: horizontal and vertical.</p>
<p><strong>Vertical</strong> slices happen when there are dependencies, and modules look like layers stacked on top of each other with each layer being allowed to access the layers below. The RunOrg project architecture actually starts with a clean set of vertical slices : the <em>controller</em> layer deals with HTTP actions by using the <em>view</em> layer and the <em>model</em> layer below it, but the <em>view</em> layer cannot access the <em>controller</em> layer, and the <em>model</em> layer cannot access either.</p>
<p><strong>Horizontal</strong> slices happen when there are absolutely no dependencies, and modules look like books cleanly arranged next to each other on a shelf. This usually happens when those modules represent the same <em>concept</em> for different <em>purposes</em>. In the RunOrg project, the controller layer is divided into many action modules, with each of these modules handling the HTTP requests for a limited part of the application. For instance, there&#8217;s a <em>Login</em> module in charge of handling HTTP requests related to logging in, and a <em>File</em> module in charge of handling HTTP requests related to uploading files. The <em>concept</em> is the same (handle HTTP requests) but the <em>purpose</em> is different (logging in, uploading files). And there is no need for either module to know about the existence of the other.</p>
<p>Knowing whether slices are vertical or horizontal immediately tells the programmer about what dependencies should be considered for that slice. And it is all recursive : the <em>Login</em> module of the controller layer is further divided into a <em>Login_common</em> bottom layer for common definitions, the root <em>Login</em> top layer for binding everything together, and an intermediary layer of horizontal <em>Login_form</em>, <em>Login_signup</em>, <em>Login_lost</em> slices dedicated to the various independent aspects of logging in. The naming convention helps identify the pattern used.</p>
<p>In practice, the slices do not necessarily map to actual namespaces or modules because, especially at very low levels, the granularity involved to segregate the two would be too verbose. For instance, while it may appear that the <em>controller</em> layer is made up of modules that are all horizontal slices, this is not the case : while the <em>actions</em> (functions that respond to HTTP requests) are indeed independent horizontal slices, the layer also contains <em>helpers</em> (functions that provide common functionality to actions) that follow a vertical layering, and a given module will usually contain both actions and helpers indiscriminately.</p>
<p>What is relevant here is that the <strong>patterns</strong> used will let you determine easily what kind of slice you are dealing with. And a pattern is a named convention (<em>action</em>, <em>action helper</em>, <em>view template</em>, <em>table</em>) that is respected by relevant pieces of code, in terms of :</p>
<ul>
<li>Location : where is it within the module and file hierarchy, and in relation to other constructs within the same module ?</li>
<li>Structure : how does the code look like ? What parts of the pattern are expected to be changed and what parts should always be the same ?</li>
<li>Type : what is the signature of the module, class or function defined by the code ?</li>
<li>Name : is there a common suffix or a way to give a name to entities following the pattern ?</li>
</ul>
<p>These are guidelines, a pattern should usually have at least one of these, and the more the better, but you don&#8217;t have to implement all four if it is counter-productive to do so. Also, a pattern should define a dependency rule : it is generally understood that two pieces of code that follow the same pattern have a dependency dictated by that pattern, and that dependency is usually a horizontal slice.</p>
<p>The important thing about patterns is that they are not an external influence on your project. If you limit yourself only to those patterns that are dictated by the Gang of Four book, or by the framework you are using, then you will miss out on the many patterns that will emerge naturally within your application. Quite to the contrary, it is essential to identify as often as possible the patterns that appear in your code, clean them up by providing both a name and conventions of location/structure/type/name, and apply them wherever necessary. This will make your code more easily recognized by the programmer, because there are only a handful of fairly generic concepts to learn (the patterns) and everything else can be understood by finding out what patterns are used. Even better, familiarity with patterns places a  «you don&#8217;t need to think about this» sign on the parts of the pattern structure that stay the same, because they never change.</p>
<p>And now, I have cleverly returned to my previous point : the inevitable conflict between the <strong>use the best tool</strong> rule and the <strong>use the same pattern</strong> rule.</p>
<p>Using the best tool creates the risk of writing code that is very difficult to understand later on, because there are too many special cases. Using the same pattern everywhere causes problems when the pattern is ill suited to the problem being solved, such that it creates code that is too long, too repetitive, or too unsafe.</p>
<p>In my day-to-day routine, I follow the <strong>use the same pattern</strong> rule until it becomes too painful. Then, I just <strong>change the pattern</strong> to make it less painful to use, and propagate the changes to all the places where it is used, which in turn was made possible by the fact that I did use the same pattern everywhere.</p>
<p>Obviously, I don&#8217;t have a pattern for everything. So, whenever I encounter a problem for the first time, I go with the best tool instead. Once that kind of problem is solved several times, a pattern will emerge and some refactoring will happen.</p>
<p><small>Article image © holycalamity — <a href="http://www.flickr.com/photos/toyochin/1382531438/">Flickr</a></small></p>
]]></content:encoded>
			<wfw:commentRss>http://www.nicollet.net/2011/05/dealing-with-huge-projects/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Idle Musings on CouchDB Architectures</title>
		<link>http://www.nicollet.net/2011/02/couchdb-architecture/</link>
		<comments>http://www.nicollet.net/2011/02/couchdb-architecture/#comments</comments>
		<pubDate>Fri, 11 Feb 2011 09:43:00 +0000</pubDate>
		<dc:creator>Victor Nicollet</dc:creator>
				<category><![CDATA[Dynamic]]></category>
		<category><![CDATA[Architecture]]></category>
		<category><![CDATA[CouchDB]]></category>

		<guid isPermaLink="false">http://www.nicollet.net/?p=2228</guid>
		<description><![CDATA[One of my recent application design decisions was to go the full NoSQL route, using CouchDB as the database. For those of you who never heard of it, CouchDB has the following benefits that I rely on: Flexible format. It stores any JSON-encoded object instead of lines with pre-defined data columns. In fact, there is [...]]]></description>
			<content:encoded><![CDATA[<p><img class="size-full wp-image-2229 alignright" style="margin-left: 25px; margin-bottom: 25px;" title="logo" src="http://www.nicollet.net/wp-content/uploads/2011/02/logo.png" alt="" width="175" height="150" />One of my recent application design decisions was to go the full NoSQL route, using <a href="http://couchdb.apache.org/" target="_blank">CouchDB</a> as the database. For those of you who never heard of it, CouchDB has the following benefits that I rely on:</p>
<ul>
<li><strong>Flexible format</strong>. It stores any JSON-encoded object instead of lines with pre-defined data columns. In fact, there is no built-in schema, so the application code is free to store any kind of data format in there without having to run ALTER TABLE statements. This was the biggest selling point for me, because it drastically reduces the development time of any database-related code.</li>
<li><strong>Lock-free Master-Master replication</strong>. Out of the box : you can have two CouchDB instances running on two servers, and send write requests to both, and they will <em>eventually</em> be in sync. This is great for both spreading the write load and for handling server failure gracefully. The catch: your application is partially responsible for not screwing things up.</li>
<li><strong>Complex Queries</strong>. As far as queries go, CouchDB only caters to <em>high-performance</em> queries. Just like in a traditional RDBMS, query performance is improved by defining an index. The difference is that if no index is defined, you <em>cannot run</em> the query on CouchDB (whereas with SQL, the query would run by doing full table scans or in-memory sorting or similar low-performance fallback solutions).  This means that if your queries run, they&#8217;ll run <em>fast</em>, but it also requires you to put more thought into any requests you might need to make. On the other hand, CouchDB indexes (they&#8217;re called <em>views</em>) are much more powerful than standard SQL indexes. If you wanted to, say, sort a list of users by the number of non-ASCII characters in their name, you could certainly do so.</li>
</ul>
<p>The general consensus over NoSQL is that it should mean «Not Only SQL» : use an SQL database for the added query flexibility and transaction management, and use a non-SQL solution where performance requirements create a need for it. The typical solution would be to have the SQL database act as a Master and regularly update the data in the otherwise read-only non-SQL database. And there are <a href="http://www.infoq.com/presentations/Enterprise-NoSQL" target="_blank">pretty good points to be made about that</a>, such as Enterprise products being expected to respond to a vast range of queries.</p>
<p>Still, RunOrg¹ is not (<em>yet</em>) intended as an Enterprise product, and all the data display goes through the web screens we design — so there&#8217;s no serious pain in writing the CouchDB views at the same time.</p>
<p>And while CouchDB does not support transactions, there is a clean way in which transaction-like effects can be achieved (with the added benefit of being lock-free): every document in a CouchDB database stores a <em>revision hash</em>, and updating or deleting an element requires you to provide the current revision hash as part of the query. So, the typical update process looks like this:</p>
<ol>
<li>Read current document.</li>
<li>Construct new document based on existing data.</li>
<li>Send the update query, using the revision hash from step 1.</li>
<li>If step 3 failed (because the document was changed while you were working on it), go back to step 1.</li>
</ol>
<p>If you&#8217;ve ever used Subversion <em>et al</em>,  you can recognize steps 1-2-3 as being Update, Merge, Commit. It&#8217;s that simple. The catch: it&#8217;s limited to only one document at a time, so <em>you cannot atomically update two documents</em>.</p>
<p>Of course, if you&#8217;ve ever used CouchDB, you probably know all of this, so let&#8217;s get to the point already. When you design your CouchDB databases, there are several important things you need to keep in mind. In no particular order, they are:</p>
<h4><img src="http://runorg.com/public/icon/time.png" alt="" /> You Need Asynchronous Processing</h4>
<p>Any CouchDB setup needs some tending-to on a regular basis. Databases have to be compacted (I do so hourly, but I expect this to evolve as the number of users increases), changes between databases will need to be propagated and conflicts will have to be detected and handled.</p>
<p>In addition to that, you will probably have application-related needs, such as sending e-mail (you don&#8217;t want to lock the HTTP request until the SMTP conversation is over, especially if you&#8217;re sending more than one message at a time), processing image files or office documents (you probably want to do that processing on a different server with the appropriate software anyway) or long-running requests.</p>
<p>In short, you need one or more asynchronous processing bots and interacting with the database. To make things easier and reuse the data access code, I just write my software to be able to run in both async-bot mode and in HTTP-server mode. This ended in designing &#8220;save task to database for later execution&#8221; construct in my supporting library.</p>
<h4><img src="http://runorg.com/public/icon/chart_organisation.png" alt="" /> Determine the Master Data and Propagate</h4>
<p>While the CouchDB view system is fairly flexible, it&#8217;s not universal. The textbook example is blog article tags. The tags on a given article would be stored as an array in the article document itself:</p>
<pre style="padding-left: 30px;">{
  "title" : "Idle Musings on CouchDB Architectures",
  "body"  : "...",
  "tags"  : [ "Architecture" , "CouchDB" ]
}
</pre>
<p>This simple format lets you <strong>1.</strong> see an article&#8217;s tags, <strong>2.</strong> find all articles with a given tag and <strong>3.</strong> count the number of articles for each tag. It does not, however, let you find the ten most used tags — you could certainly query the entire &#8220;number of articles for each tag&#8221; view and then sort the data in memory, but if an application contains tens of thousands of one-document tags, you&#8217;re basically querying 1000 lines for every line you display.</p>
<p>The suggested solution is to create another database (usually in the same database server) to store documents representing the tags in a more adapted format, such as:</p>
<pre style="padding-left: 30px;">{
  "_id" : "Architecture",
  "num" : 18
}
</pre>
<p>This format lets you easily sort the document on the num field. Problem Solved!</p>
<p>How you actually copy the data from one database to the other is up to you. Just keep in mind:</p>
<ul>
<li>Always clearly determine which documents or fields are original data and which are cache data. You don&#8217;t want to mistakenly update the master from the slave, or update the slave instead of the master. Try to keep the distinction at database level (this database contains slave data), and only resort to slave fields when it&#8217;s absolutely necessary.</li>
<li>You <strong>need</strong> to have a periodic refresh process that rebuilds the cache from scratch, just in case it ended up out of sync. Depending on your data, a day-long process, an hourly process or a midnight process might be more adapted.</li>
<li>How fresh must your data be? Perfect freshness means you need to update the cache as part of the normal document-saving process — great for having up-to-date data, but slows things down. Minute-level freshness lets you delegate the cache update to an async process that detects the change and refreshes the cache. With hour-level freshness, you can rely solely on the complete cache rebuild.</li>
</ul>
<p>Either way, have a consistent picture in mind of how you want to achieve this before you need it — trust me, copying data to another database is a lot easier than hand-crafting the perfect data structure to handle every query you need.</p>
<h4><img src="http://runorg.com/public/icon/table_row_insert.png" alt="" /> Don&#8217;t Update, Insert-and-Merge instead</h4>
<p>Updates are slower than plain inserts because you need to read the original data  first, and they create the risk of conflicts with Master-Master replication. The textbook example is with Alice and Bob updating a given document in their respective databases: Alice sets the title to <em>Foo</em>, and Bob adds a few words to the article&#8217;s body. Then, replication happens, a conflict appears and you need to determine what the title and body should be — in this case, the sane thing to do would be to keep Alice&#8217;s title and Bob&#8217;s edits to the body, but you don&#8217;t have enough information when resolving the conflict to actually know that.</p>
<p>Now, consider a different strategy: Alice inserts a &#8220;set title to <em>Foo</em>&#8221; line in her database while Bob inserts a &#8220;change document body&#8221; line in his database. You can then retrieve the current version of a document by reading all the lines related to that document and merging them together according to whatever rules you see fit (and, for bonus performance points, save the result to another cache database as described above). When the replication happens, both lines will appear in both databases, the merge code will run again, and both changes will appear in the resulting document.</p>
<p>And you get a free revision history with the ability to selectively cancel changes down the line. Or, you can decide to compact older inserts (when it becomes obvious that there&#8217;s no risk of collision anymore) to save memory and improve merge performance.</p>
<p>Please note that this applies only to <em>master</em> data — <em>slave</em> data conflicts can be trivially solved by refreshing the cache from the master, so update slave data to your heart&#8217;s content.</p>
<h4><img src="http://runorg.com/public/icon/key_delete.png" alt="" /> Avoid Unique Constraints</h4>
<p>This is possibly the single most annoying issue with CouchDB, but it&#8217;s pretty much part of the distributed Master-Master package. The basic idea is this: you need a given field or value provided by the user to be unique. For instance, you don&#8217;t want two accounts to have the same username. So, when users pick their user name, you need to atomically check if it&#8217;s available and reserve it. This is impossible in a Master-Master scheme without expensive locks or elaborate cross-transactional strategies.</p>
<p>The first thing you should try is to eliminate that constraint or turn it into something a little bit more amenable. For instance, if two users reserve the same e-mail address as their username, you could detect that once it happened and merge the accounts — they probably belong to the same user. When it&#8217;s applicable, the detect-collision-and-merge solution is the one that&#8217;s easier for performance.</p>
<p>The last resort solution, which isn&#8217;t actually that bad, is to give up on Master-Master replication <em>for that specific feature</em>. You can have a dedicated database to store username-id relationships:</p>
<pre style="padding-left: 30px;">{
  "_id"    : "victor.nicollet",
  "account": "3958377a5093b22673a26b6c33002e02"
}</pre>
<p>The actual account would be stored in a different database which does use Master-Master replication:</p>
<pre style="padding-left: 30px;">{
  "_id"     : "3958377a5093b22673a26b6c33002e02",
  "username": "victor.nicollet",
  "fname"   : "Victor",
  "lname"   : "Nicollet",
  "passhash": "..."
}
</pre>
<p>All creation requests would first try to create the username document in the first database — if it&#8217;s already taken, an insert conflict happens immediately and you can react by asking the user for another username. If no conflict happens, you can then insert the document in the second database with no fear of conflicts. The same process happens when trying to change the username of an existing account.</p>
<p>This solution works as long as the number of creation requests remains small, and you can afford the round-trip to the central database — which might be on a different continent (or worse, you might be working offline).</p>
<h4><img src="http://runorg.com/public/icon/cut.png" alt="" /> On The Splitting Of Databases</h4>
<p>With CouchDB, there&#8217;s the inherent restriction that a given database must be entirely contained within a single server. Replication lets you split the <em>request</em> load, but the <em>storage</em> load remains the same as both duplicates must store the exact same data eventually. In addition to that, the underlying storage strategy is insert-only, meaning that if you update a given document a thousand times, you eat up a thousand times the memory footprint of that document — and in order to run a compaction, the server needs to have enough disk space to store both the uncompacted database and the newly compacted one, so <strong>always make sure you&#8217;re not running close to full disk usage</strong>.</p>
<p>Keeping the database small is done in a number of ways. For instance, keeping ID values small (I use 11-character base-62 UUIDs) or having short fields name in the JSON documents. One of the most potent techniques is simply splitting the contents across several databases.</p>
<p>One splitting strategy is to act based on the ID or another reasonably distributed property of the document. If even ID values are on server 0 and odd ID values are on server 1, you do not hinder your query-by-ID possibilities at all (you can determine if the ID is even or odd quite easily in the application, and pick the appropriate server). On the other hand, views don&#8217;t work anymore, since neither server can aggregate the entire range of existing values. You can get views to work by striving to keep together subsets of values that are queried together — RunOrg¹ stores items together on a per-association basis, so all items of a given association are available on the same server and can be map-reduced together — but this makes query-by-ID harder to achieve. Tradeoffs, as usual.</p>
<p>Another splitting strategy is to act based on the type of the document. Put user accounts in database A, posted articles in database B, article incremental updates in database C and user comments in database D. As long as there&#8217;s no database-level interaction between the different documents, this will work, and the code loading a document knows what kind of document it&#8217;s loading and where it should load it from. The catch: a view from database B cannot return documents from database A (so, no querying for &#8220;the article and its comments&#8221; or &#8220;the updates and their authors&#8221;). Be aware of that limitation ahead of time.</p>
<h4><img src="http://runorg.com/public/icon/map.png" alt="" /> Keep A Schema</h4>
<p>Not requiring a schema is no excuse for not having one available. It always helps to remember what is being stored where and how (and why!), which documents are updated from which others, and so on. While standard relational representations can work, don&#8217;t be afraid to include document-store-related representation features, such as the way for an item to reference a list of other items, and views.</p>
<h4><img src="http://runorg.com/public/icon/heart.png" alt="" /> Final Words</h4>
<p>As it stands, CouchDB is certainly expressive enough for our needs, once a few elementary features (propagation and asynchronous tasks) are available. The high flexibility of the data model, combined with the fact that an application can easily be made to read several document formats and convert to the latest one when possible, greatly improves the speed at which new modules are developed — right now, model code accounts for 5% of work time, with refactoring at 15% and user interface (HTML-CSS-JS) towers at a heartbreaking 80%.</p>
<p>What are your takes on CouchDB? Do you use it? Did you try it? Any experiences you might have that would be worth sharing?</p>
<p><small>¹ <a href="http://www.runorg.com/" target="_self">RunOrg</a> is my  Start-Up ; we provide an online tool that helps associations,  unions,  organizations and communities manage their members, contacts,   activities, events, knowledge and online presence.</small></p>
]]></content:encoded>
			<wfw:commentRss>http://www.nicollet.net/2011/02/couchdb-architecture/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

