Monthly Archive for June, 2011

A Decentralized Dating Protocol

I’ve been wondering about how a decentralized dating protocol could be implemented.

Here’s the idea : Alice would love to date Bob, and Bob would also love to date Alice, but both of them are too shy to actually ask the other out. The good news is, Alice and Bob are the protagonists of almost every single cryptography protocol out there, and they naturally have asymmetric encryption keys. So, they resort to using the MAP (Mutual Attraction Protocol) to reveal whether they are interested in each other, but only if the feeling is indeed mutual.

An elementary implementation of the protocol could behave as such:

  1. Bob wishes to tell Alice about his feelings. He concatenates a fixed ILOVEYOU prefix and random suffix RAND, encrypts it with Alice’s public key APub, then with his private key BPriv, This creates the cipher BPriv(APub(ILOVEYOU || RAND)).
  2. He then publishes the cipher to a public location in a way that can be easily found by anyone looking for it — such as a database indexed by the public key of the uploader.
  3. Alice wishes to know whether Bob has feelings about her. She scours the public databases for any items matching Bob’s public key, and stumbles upon the aforementioned BPriv(APub(ILOVEYOU || RAND)) cipher.
  4. Shen then decrypts it with Bob’s public key, and uses her own private key to decrypt the result. She ends up with ILOVEYOU || RAND and determines, based on the presence of the initial ILOVEYOU that there’s indeed a love interest around.

The protocol is of course symmetrical: when you wish to express your interest, you should both publish it to the public database and look through that database for a reciprocal interest.

This version of MAP has the following interesting properties:

  • Only Bob can publish Bob’s feelings on the public database, because doing so requires encrypting them with Bob’s private key.
  • Only Alice can determine whether a given entry on the public database is about her, because doing so involves her own private key.

However, it assumes that Alice will only search for Bob’s message if she is herself interested in him. This is not necessarily true : Alice might be running some software that automatically searches through the public database for people who like her. This clearly violates the constraint that the feelings should only be revealed if they are indeed mutual — that, before Alice can determine whether Bob posted an “I’d date Alice” message, she must post an “I’d date Bob” message herself for Bob to read.

The question I am stuck on is, can these constraints be implemented without involving a trusted third party ?

Article image © Nattu — Flickr

Get Rid of Constraints

We add constraints (UNIQUE, FOREIGN KEY…) to our databases in order to achieve a tradeoff between how easy it is to write data to the database, and how easy it is to read data back. Constraints make writes harder because the application must ensure that those constraints are verified, and react appropriately when they are not. They make reads easier because they allow the application to make assumptions about the data it receives.

Sometimes, the benefits to data reading are not sufficient to offset the increased difficulty of writing it. Let me provide an example.

In a project management tool, users on the free plan are only allowed to have one project. Premium users can have any number of projects. An obvious solution would be to constrain the database so that a given  user can only have one free project at once. This would guarantee that any attempt to connect a second free project to an user would fail.

However, the possibility of non-technical failure of “connect a project to an user” will have to be handled in a variety of unusual locations:

  • Deleting project A, creating project B, and cancelling the deletion of project A would cause an error.
  • Merging two free user accounts would be interrupted because both already have a free project.
  • Downgrading from premium to free would break if the premium account had more than one project.

The good news is that your database will stop you from doing anything that would break the constraint, so you do not have to worry about a glitch letting free users have two projects. The bad news is that you now need to handle that failure specifically in three different locations in your interface: cancelling a deletion, merging two accounts, and an account downgrade that might happen when the user isn’t even connected. It’s possible to handle all three, but it’s quite repetitive, and the user interface for handling these is probably going to be different in each case.

A Posteriori Constraints

A better solution would be to allow users to have any number of projects in the database, but detect free users that have more than one project (because of cancelling a deletion, merging two accounts, or not renewing their premium subscription) and force them to take corrective action through the user interface. For instance, users with inconsistent accounts are greeted with a list of their projects and the possibility to keep one of them unlocked. The other projects are then locked and can only be unlocked by deleting the currently unlocked project or paying for a premium subscription again. You can even provide some leeway by letting them keep full access to all projects for a week before forcing locks on them, or let customer support temporarily unlock projects in cases of dire need.

The beautiful aspect of this is that the code required to solve the inconsistency is not repeated. Regardless of how you end up with more than one free project, the feature that corrects this is always the same. This means the application is smaller, which leads to fewer bugs and a more consistent user interface.

Of course, if non-subscribing users already have a project and try to create another, they are told to subscribe. Lack of constraints in the database does not mean there are no constraints or clever up-sells in the interface. While RunOrg does prevent users from exceeding the storage ceiling by uploading new documents, all other ways of breaking past the ceiling (such as un-deleting documents or lowering the ceiling) remain unaffected, and our administrator team only takes corrective action when there is a clear abuse of the software — you can stay a few megabytes above the ceiling for as long as you wish. As for the limit on how many users can join a community, there are even fewer restrictions — it would be quite harsh to prevent the 2011-2012 members from joining only because the 2010-2011 members have not left yet — so we ask our customers to return below those ceilings as fast as possible, and only do it ourselves if there is abuse.

There is a conceptual shift here from No Inconsistent Data In The Database to Detect And Correct Inconsistent Data. This step can be a bit difficult to stomach, especially if you are used to letting your database keep your data clean for you, but for many constraints the shift is actually worth it.

In the early design phase, we moved RunOrg from a standard MySQL architecture over to CouchDB, a NoSQL solution that provides absolutely no constraints beyond having a unique primary key. There are currently no foreign key constraints in the system, nor are there any unique constraints beyond enforcing many-to-many relationships (and these are fundamentally primary keys). This seemed like a huge challenge at first, but the system has grown to support complex data structures and business rules that go beyond standard document-based CRUD, simply by accepting that pieces of data can be missing or be duplicated, and applying an a posteriori solution, possibly involving user intervention, has worked for us so far.

Of course, what is true for document databases is not necessarily true in the SQL world. While we impose no constraints on the relationships between documents, we are pretty strict about the structure of documents themselves (especially since they are being read from OCaml, a strict static language), and there is no simple way of representing “Foo holds a list of Bars” in SQL without a foreign key (in CouchDB, it’s standard JSON…)

This is the important information here: those are structural constraints, they define a data structure in the database that must be respected by your implementation, as opposed to non-structural constraints that merely define relationships between higher-level entities and must be respected by your users. Identify the latter, and consider whether you can shift them to a posteriori constraints instead.

Article Image © Calsidyrose — Flickr

Does this repository make me look fat?

The main RunOrg SVN repository contains:

  • 38490 lines of OCaml implementation code spread over 273 *.ml files.
  • 5134 lines of OCaml signature code spread over 156*.mli files.
  • 4776 lines of HTML template code spread over 271 *.htm files.
  • 1971 lines of CSS in 6 files.
  • 1898 lines of JSON configuration in 26 files.
  • 1172 lines of JavaScript in 5 files.

That’s a total 53441 lines in 837 files.

Here’s a plot of how the 38490 lines of OCaml code came into being over the last eight months.

That is all.

Software Patents – Why, Why Not

Intellectual property exists for a reason: protecting creators from the theft of their creations. Not in the sense that the idea is stolen — idea theft would imply that the original owner would not have access to the idea anymore — but rather that the opportunity to monetize that idea is stolen or destroyed by someone with the industrial and commercial firepower to completely take over the market. Georges Méliès created the 1902 motion picture Le Voyage Dans La Lune, which was a special effects masterpiece at the time, but was denied the opportunity to monetize it in the United States because of piracy and eventually went bankrupt. The question asked by intellectual property supporters is, would people invest time and money in creating new motion pictures if they knew the same thing could happen to them? Copyright is intended to ensure that, if someone runs with the author’s work, the author can shut them down or make them pay.

The same reasoning exists behind patents: inventors spend time and money on the arduous process of elaborating and perfecting new machines or processes. Then, he starts selling the invention, and inevitably, other people and companies will notice. People and companies that did not participate in the elaboration, that have a better manufacturing infrastructure, a better sales network, and are not weighed down by the cost of researching the invention in the first place, so they can easily manufacture and sell the invention, outmatch the inventor on the market due to lower prices and larger quantities, and keep the profit to themselves. But if the inventor holds a patent, he can force the rogue manufacturers to pay him for his invention.

Perhaps the single greatest example of modern patents is the pharmaceutical industry. The cost to creating a new drug is tremendous, not because thinking of new molecules is a strenuous activity, but because of the many clinical trials required to turn tens of candidate molecules into a single FDA-approved doctor-prescribed drug. But once the drug exists, competitors can find out its composition at a fraction of the original cost (after all, the composition is published as part of the approval process), manufacture it, and sell it at a price that does not need to pay for the original research and development. Were they allowed to do so, one would expect the pharmaceutical innovation to freeze in a prisoner’s dilemma as stealing the drugs of others is far more profitable than inventing your own.

Patents work in these traditional industries for two reasons:

  • Inventing the patented mechanism or creation is a costly, arduous process, but creating a competing product based on the invention is comparatively cheap. This means that without patents, copying is more profitable than inventing.
  • Two people inventing the exact same thing without hearing from one another is extremely unlikely, either because the inventions are so esoteric and unusual that no two people would reasonably think of the same thing, or because the inventors are part of a community of experts that regularly hear of what the others are working on.

Software patents protect algorithms — processes executed by computers to transform data and interact with humans and other computers. The problem with patenting software is that the two reasons above are the exception rather than the rule.

Creating competing software products is hard. Assuming that Microsoft invents a wonderful new algorithm and includes it in Windows 8 without patenting it, simply reusing that algorithm does not make the creation of a serious Windows 8 competitor any easier. The modern software industry is so complex and multi-faceted that the reuse of any algorithm, no matter how clever or innovative, would not contribute significantly to the creation of a competing product. Copying over a single algorithm does not grant your competitors access to your network of compatible software, your corporate partnerships, your user base, your software development infrastructure, your reputation, your existing maintenance contracts, or any of the many other ways in which the average software company turns an algorithm into money.

And there’s more. The job of software programmers is to create new algorithms, and there are quite a lot of software programmers around. Hundreds of thousands of software programmers. This means that any algorithm which happens to be a slight modification or adaptation of an existing known algorithm, or the result of a deterministic problem-solving strategy, has already been invented at least a dozen times, currently sleep in a legacy project somewhere, and will be re-invented another dozen times in the coming years.

The possibility that someone might steal their work does not deter these programmers from inventing algorithms.

There is no prisoner’s dilemma in software development where everyone waits for the others to invent new things so they can copy them. We have a thriving start-up community bent on inventing new things that almost never patents anything at all. We have an industry that focuses on the quality and nature of the data it collects rather than the ever-changing processes that are applied to that data. The software industry in general does not need patents to keep the innovation ball rolling.

This, in itself, would be no issue — if software companies want to give money to the USPTO for nothing, let them be.

The problem is that software patents are actually hurting innovation. There is a strong tendency for the USPTO to grant patents for algorithms that are not in any way esoteric or unusual, or even new. As a software programmer, I invent new algorithms as part of my daily showering routine ; there is a significant probability that those algorithms are patented either because they are so obvious that a patent-happy company already thought of them, or because they are a special case of one of the surprisingly broad “any computing device, any storage mechanism, any input mechanism” patents being granted recently. The fact that I independently developed that algorithm through logical reasoning, simple modifications of an existing public domain approach, or even genius, does not absolve me from respecting the patent. Sure, I could weasel my way out of the patent by tweaking the algorithm, but doing so is hard for two reasons that are completely unrelated to the underlying technical principles :

  • The only way for me to know what patent I must work around is to look through all existing patents for those that might be applicable to my situation. This means I need to research existing patents as part of my daily shower routine, which is highly inefficient and wastes a lot of water.
  • Even if I was able to identify the applicable patents, I would certainly be able to create algorithms that still did the job without infringement, but I need to hire a lawyer to determine whether there is indeed no infringement.

All of this leads to a massive increase in the cost of developing new software, on the scale of having to check for copyright infringement every time you write an e-mail message. Where software innovation used to be a single engineer working for a few days on a project (which, let’s admit it, is fairly cheap), it now involves much costlier patent lawyers even in the case when no infringement actually happens. Needless to say, most start-ups don’t bother with the lawyers and live at constant risk of discovering that one of the algorithms they created on their own has been patented by someone else.

And all of this happens even when the patent is obviously invalid due to prior art, because it’s usually cheaper to spend a few days working around the patent than to risk going to court over it.

I will not even go into how the monopolies induced by the patent system hurt the users themselves. There have been plenty of discussions on that already.

But there is a flip side to all of this. Patents are not only about inventing new algorithms and processes, they also pay for the cost of validating them. Some algorithms are not solutions to black-and-white, solved-or-not-solved problems where one can prove on the white board that the invention always produces the correct output. Some problems are so complex that any candidate algorithm must be tested against huge amounts of data, and the testing is not cheap. It might be run against data that the inventor has spent a lot of time collecting from the real world, it might be run against data that the inventor had to buy at a steep price, or it might be tested on real-life humans or customers with the risks and costs this involves. And even if the tests themselves are free, there’s the possibility that one needs to try out hundreds of different algorithms or parameters before an acceptable solution is found.

This is almost the same problem as the pharmaceutical industry, where thinking up new molecules is cheap and running the clinical tests is the real cost. The only difference is that algorithm validation is not a mandatory or even obvious process. The various pieces of user interface in Apple, Facebook or Google products might seem utterly obvious when you see them, but each of them passed grueling internal testing processes and survived thrilling tournaments of feature comparisons before ever going live, and these processes and tournaments cost a lot more than the developers thinking about user interfaces during their morning shower.

I suppose the real question that needs to be asked is, why do we have a patent system in the first place? Is it because innovation cannot exist without patents (which is clearly not the case as far as the software industry is concerned) or is it because we somehow feel that pioneers deserve some kind of reward for being the first, even though they were often not the only ones?

Article image © Stuart Heath — Flickr

Pipelines

You’re a company. Your customers give you money in return for something. They usually start out without any knowledge of your existence and end up giving you money, and the paths they follow are one of the many customer acquisition pipelines you have created (perhaps unintentionally) for your company.

They are called pipelines because they usually convey a continuous flow of customers from point A (never heard of you) to point B (gave you money). Each is comprised of segments:

  1. They just heard about you: maybe you sent them an e-mail, called them on their phone or contacted them through a friend. Or maybe they did a Google search for your product and you were on the results page. Or they saw an ad for your product somewhere. Or your product was recommended by someone they trust. Whatever the reason, they are now aware of your existence. Some of these people will move on to the next step, which is to…
  2. Learn more about you: most of the time, they will visit your web site and sometimes look for references online or in their social circles. Sometimes, they call you, or have you call them, or even ask for a meeting and quick presentation. Either way, some of these contacts will decide that your company or product is interesting enough to…
  3. Try out your product: this could be a free limited-time trial, a demo version, an on-site pilot, a trailer video, a sample tray in a shop, or any other form of discovery that lets them actually experience your product without any serious commitment on their part. You can get away without a demo version if your product is either cheap enough that buying it once does not count as a significant commitment (such as buying a candy) or popular enough that it can convince customers on reputation alone (such as Half-Life 3). Anyway, based on the trial or the reputation, they will decide whether they…
  4. Become customers: this is the interesting part, where you get money in return for whatever you provide.

The number of people at each stage in the pipeline decreases: not all people who hear about you will even visit your website, not all people who visit your website will download the demo, and not all people who download the demo will buy the full version. Since having more people at the bottom of the pipeline is a good idea, a lot of efforts will go into making the pipeline more efficient — losing less people along the way. The customer conversion rate represents the percentage of customers that go from point A to point B, and you usually want the CR to be as high as possible.

In practice, there are two other important things to keep in mind : the customer acquisition cost is how much money you spend on a given customer (on average) to get them through the pipeline. If your pipeline costs you $200,000 in advertisement, phone bills, salesman wages and bonuses, to turn 1,000,000 people into 1,000 customers, then your CAC for that pipeline is $200/customer. The customer lifetime value is how much you can expect to earn from that customer over time, after subtracting the cost of the products you sold. If a customer pays $5 for a book that cost you $2 to manufacture, and never comes back, their CLV is $3.

What matters, then, is not the customer conversion rate, but your net profit per customer : CLV – CAC. A pipeline with a CAC of $200 that leads to a book-buying CLV of $3 is a pipeline that wastes $197 per customer in marketing-and sales-related costs: you spent $200,000 to have 1,000 customers buy $5000 worth of books (of which you only get to keep $3000) for a net loss of $197,000 ! In short, your job is making sure that the CLV – CAC difference is as high as possible in all your pipelines.

The reason why increasing the CR is a good idea is that a higher rate means more people come out of the pipeline, so the total pipeline cost is divided among all of them, and thus the CAC decreases. If I were to double the conversion rate on the above $200,000 pipeline, I would get 2,000 customers and the CAC would decrease to $100. Usually, a higher CR means lower CAC, but this breaks down when increasing the CR costs money: if you paid an additional $300,000 to get the better advertisements and salesmen that allow you to jump to 2,000 customers, then the CAC would actually increase to $250 ! Similarly, lowering the price usually means more people will become customers, which increases the CR and decreases the CAC — but it will also have an impact on the CLV, since you are now earning less money on every customer. If the price drops by $10 and the cost drops by $7, you just lost $3. Per customer.

Optimizing Pipelines Is Hard

The entire issue of sales and marketing can be summed up in a single short sentence:

You have to tweak the customer pipeline, which costs money, and you have no idea whether the conversion rate will improve enough to pay for the changes.

Usually, people have reasons to leave the pipeline other than the price. By finding those reasons and addressing the underlying issues, you increase the conversion rate. So, your course of action is:

  1. Find out why people are leaving the pipelines.
  2. Find out how you could eliminate or mitigate that issue.
  3. Estimate how much it would cost, how the conversion rate would evolve, and the resulting acquisition cost.
  4. Pick the solution that improves the acquisition cost the most, and implement it.
  5. Go to step 1.

Steps 1 and 3 are the difficult ones.

Step 1 is difficult because you need to determine why people who are not your customers don’t buy your product. Do you have an on-site satisfaction poll ? Do you send them a quick e-mail ? Do you have their contact information at all ? Do you even know what your conversion rate is ?

Step 3 is difficult because you actually need to estimate how many people would buy based on that one change. This is even harder, because people can always find new reasons not to buy, so that number will probably be smaller than your estimates from step 1.

These steps usually turn into the easier sequence known as A/B testing:

  1. Guess, or pay an expert to guess, why the people might be leaving the pipelines.
  2. Implement a solution that addresses the issue.
  3. Compare the new (B) and the old (A) conversion rates.
  4. If the solution actually improved profits, keep it !
  5. Go to step 1.

This solution does work, but it has two pretty heavy limitations. The first is that the solution from step 2 needs to be fairly cheap — it would be insane to implement a new major feature only to discover that no one cares about paperclip-shaped assistants. So, if the actual reason people are not buying is that a critical feature is missing, you will not find out through A/B testing.

The second limitation is that A/B testing has high fixed costs, because you need to actually change page layouts and shopping cart workflows for every test. These costs are one-shot, so they will be spread over all the customers coming out of the new pipeline — spending $1000 to get from 10,000 customers to 15,000 customers is a bargain, spending that same $1000 to get from 10 to 15 is not. So, unless your existing pipeline already has a fairly good throughput, A/B testing solutions to significant issues might just be too costly.

Article image © Brian Cantoni — Flickr



1342 feed subscribers
(readers who polled a feed this week)