Archive for the 'Strategy' Category

Do It Yourself

Unless you’re working in an esoteric field on the bleeding edge of technology, the vast majority of programming problems you face have already been solved many times by many other people, and several of these solutions are readily available on the web or in legacy code libraries you might have access to.

To solve a problem, you can

  • reinvent a particular wheel : the non-factored approach, since you create your own instance of that wheel,  or
  • reuse one of its existing implementations : the factored approach, where several projects benefit from the same piece, including your own.

Both alternatives have costs and benefits that the experienced software engineer is aware of, and these will depend on your exact problem somewhere along the lines of :

1

The time spent solving a problem steadily increases with the size or difficulty of that problem, and is further subject to two important rules.

Non-factored is cheaper for small problems

A factored solution carries some overhead because it is used by several projects with different scopes. The “one click, 200 words” bias happens when non-technical managers hear “leverage an existing solution”, and see a picture of a one-click installer and a 200-word tutorial telling them their particular problem can be solved with two lines of C# code.

HolyGrail grail = new HolyGrail();
grail.doWhatIMean(/* No options here! ^_^ */);

Yeah. Riiiight.

Every one of us has spent days reading up on third party libraries just to decide if they are worth the effort, slaying compatibility dragons to make it talk with the rest of the project, filling hundreds of configuration options that have no relevance whatsoever to the tiny problem at hand, teaching co-workers about the nooks and crannies of that code, and painstakingly wading through less-than-civilized error reporting to solve the obtuse problems that come up on the day before you release.

Even writing your own reusable code is orders of magnitude harder than just jotting down a quick one-shot solution to whatever problem you have. An excessive tendency to build generic code from the very beginning makes your development process look like Dragon Ball Z : you have to power up for fifteen episodes before you can show a splash screen.

This rule is the reason why the red curve stays above the blue curve for small problems.

Factored scales better for large problems

Solving a larger problem involves a larger solution. In a do-it-yourself situation, you have to make the solution larger yourself. When using a factored approach, you already injected an existing large solution into your project, and it only feels small because you’re using a small part of it. With the programming equivalent of flipping a switches, you get to use a larger part.

The solution that involves the most code (the non-factored one, in case you wondered) also involves the most maintenance, documentation and development work. Whether this comes from a thousand-line reinvented wheel or obscene copy-pasting, having a large code base is something you will have to pay for in the long run. You don’t buy code, you rent it.

This rule is the reason why the red curve ends up above the blue curve for sufficiently large problems.

Keeping these two rules in mind, the key to making the right decision is determining where the red and blue curves intersect, and where your project stands. Easier said than done. For instance, what does “problem size” mean, precisely?

Problem size can be, literally, the size of the problem for an obvious metric. A content distribution network like Amazon S3 is a bad choice for 1000 downloads per week, but an obvious solution for 1000 downloads per second.

Could be the things in the application that are similar to the one you’re implementing. Sending usage statistics back to your server is a small problem solved with a vanilla HTTP request. If you communicate with the server a lot, you might want to keep the URL and error handling logic together in one place.

Or it could be the number of features. Displaying data in table format takes two nested loops and some HTML. Sorting, filtering, asynchronous sending or editing involves some rather smart Javascript development, or integrating a tool like jqGrid or ExtJS.

Once, Twice, Refactor

The special case of writing your own reusable code has been “solved” by Agile folks who suggest writing a non-reusable version of the code on the first try, and refactoring it to a reusable version the second time it’s needed. This is your third choice : go with the non-factored solution if you are unsure whether the problem is large enough to warrant the factored solution, and change your mind as soon as you gather enough data.

2

This is a solution that costs less than the factored approach if the problem is small, and costs less than the non-factored solution if the problem is large, while keeping an acceptable overhead when the problem is somewhere in-between.

Of course, writing your own reusable code means that the cost of switching from the non-factored to the factored version is significantly lower than starting with the non-factored version from scratch, because you refactor the original solution into a reusable one.

The advantages are not so obvious when moving from one approach to the other involves throwing away all code and installing a third party application. You do get some benefits—at the very least, you know more about the problem that you did at first, and perhaps your first approach served as a useful prototype to further refine your needs—but doing this can hurt a lot.

So, you end up getting hurt if you don’t know what you’re doing. What a surprise.

Filling in the Holes

As the technical lead on a software project, I get to interact on a daily basis with stakeholders that are technologically impaired. They think in high-level, end user terms like « I need a comment system » or « send the user an e-mail notification » and they expect things to happen without having to delve into the boring techno-babblish details of how it’s done. The rationale is that deciding what happens is a stakeholder job and deciding how it happens is a developer job.

Of course, no matter how hard you try to separate the two, developers are sometimes going to decide what happens, because no stakeholder can spare the time needed to walk the development team through the bloody details of every single feature. Given the productivity gains from recent advances in development tools and the social skills of the average programmer, I think it’s fair to say that an in-depth description of a feature takes about as long as the implementation of that same feature.

This is why all projects follow the same steps regardless of the methodology used:

  1. A stakeholder makes some general statement about a feature, such as user comments being available on certain items.
  2. The development team writes what they feel is the best implementation of that feature in the context of that project, filling in the missing details as they go.
  3. The stakeholder sees the results and points out what details did not match their mental model of the requested features.

This introduces several dangers: there’s the budget issue when missing details turn out to be costlier than originally envisioned, and there’s mismatch issue that makes the customer unhappy. A good project manager should strive to reduce these. How?

Gathering more requirements is a classic strategy in waterfall models. The basic reasoning is that the more details you manage to gather about the product to be implemented, the lower the chances of a surprise requirement blowing your budget away and the higher the chances of meeting the customer’s expectations. The downside is that this step takes time, which in turn uses up the budget and delays the release.

Also be careful when deciding on a budget after having gathered all the requirements. Everyone changes their mind sooner or later, and any requirement change, no matter how small, should prompt a critical analysis of the budget: adding « just one link » is indeed a tiny change, but the involved overhead (change the internal documentation, determine the impact on other feature, tell the developer, write the code, test the changes) adds up much faster than you think.

Fast iterations involving the stakeholders is the Scrum approach, shared with many agile methodologies. It deals with budget issues by developing the simplest possible implementation that matches the requirements. It also provides the customer with feedback on the estimated implementation time through poker planning before every iteration, so that requirements can be changed on the fly if sacrificing a small feature can significantly reduce costs.

Short-iteration agile projects build customer dissatisfaction into the development process: if you don’t like the existing implementation, you can ask for a change and get it done on the next iteration. It also lets the developers decide based on technical considerations (what’s easier to implement) as opposed to high-level decisions (how the stakeholder wants a feature to behave), which lets them work faster and do what they are skilled at.

The downside to these approaches is that stakeholder involvement should not be taken for granted, and even when it is, it’s not uncommon for customers to have dissenting opinions among themselves. Also, an agile process does not help if there’s a fixed deadline and a fixed set of 1.0 features, and the customer expects these to be done in time.

Having developers with common sense helps a lot—all the people working on the project should be able to tell ahead of time if a given solution is going to be unacceptable, and dismiss it if they see one. This avoids implementing useless solutions, or forwarding an useless solution to a customer for validation.

The obvious corollary is that a developer should write bug-free code without having to be told to write bug-free code. Commits that contain segmentation faults, access violations, unhandled exceptions, blank pages, broken links or performance bottlenecks should be investigated into, to determine why a mistake was made and what steps should be taken to avoid repeating it.

What other techniques have you come up with for reducing communication-related risks?

Dashes vs Underscores

When you optimize your website for search engines, you have to take every little facet into account. Every character in an URL is a weapon for getting a better ranking than your competitors.

Which leads to quite silly bikeshed conversations.

I have heard that when part of an URL, foo-bar is considered by Google to be a single word, while foo_bar is considered to be two words. I have also heard that foo-bar is treated as two words and foo_bar is treated as one. And I have also heard that both foo-bar and foo_bar are treated equally as two words. The variety of dates available for the resources (anywhere from 2001 to 2009) makes it even harder, as I suspect Google has been evolving their algorithms on the subjects in the last eight years.

Ironically, a search for “dashes vs underscores” reveals (in the top five ranks) websites with either underscores and dashes as separators, further adding to the confusion. What is true (and easily verified) is that when part of a search query, foo-bar is treated as two words and foo_bar is treated as one word.

It’s important to notice, however, that search engines don’t exist in a vacuum. They have to take into account whatever is the most prevalent way of presenting information. And it appears, from the many websites that use the “dashes” convention, that the “dashes as two words, underscores as one word” side of the debate has won. Wordpress? Dashes. Magento? Dashes. Amazon? Dashes. Google’s own Blogger? Dashes.

So, even if the “dashes as two words, underscores as one word” side was wrong to begin with, it has become so prevalent today that it would be foolish for Google not to change their algorithm in the face of such unambiguous adoption of a word separation convention.

Besides, underscores look ugly :)

Information Flow

The real world is a complex place. When writing software that has to interact with the real world, there are literally thousands of concepts you have to master and tens of thousands of details you have to be aware of, or you will paint yourself into a corner where your software clashes with reality. And reality always wins.

Understanding concepts and details is a fundamental part of a project’s time budget, whether they come from the project requirements, real-world constraints, third party code or teammates. Every time information goes around in a project, it uses up valuable time, and to keep the time budget tight it becomes necessary to decide what information should be allowed to go around, and where.

Working on concurrent systems is an enlightening experience, because of the many similarities between an array of computers and a team of information workers. Computers arrays have latency issues when one thread depends on another thread to be done…

“When do you think your settings import module will be done? I’m stuck on the payment API until I can load those settings!„

…they have bandwidth issues and manipulating some data yourself is usually faster than sending the data to another part of the cluster for treatment…

“The User object? Well, it’s a bit of a weird design, but it’s rather clever. I’ll draw you a quick UML sketch on the blackboard so you can see what the five helper classes do.„

…they have to avoid data loss if a computer or network is down…

“I have no idea how this stored procedure works, you should ask Tim, he’s the one who wrote it. He’s in southern France right now but I think he’ll be back next month.„

… and they have to handle a directory of parts and a garbage collector for data…

“Wait, nobody’s written the comment moderation back-office! Who was in charge of doing it? Who wrote the comments front-end anyway?„

There are algorithms, strategies and techniques for handling and optimizing those things. Many of these can be adapted to humans, with the added benefit that, humans being smart, they can understand the point of those algorithms and compensate for minor flaws if the plan isn’t perfect.

Do You Care?

As I mentioned earlier, I use different e-mail addresses for every website that asks me for one. These look like victor-{website}@nicollet.net and are all redirected to the same inbox until I decide I get too much spam from them. In other news, I recently gave one such address to The Motley Fool (a financial information website) and it predictably ended up being the number one source of spam in my inbox. Get cancer and die, Fool.

Non-technical people have asked me whether such an address (namely, one that contains a hyphen) is valid. The answer is that of course, a dash is a valid character in an address (just like _, + and $ for instance) and therefore every sane MTA around the globe should be able to deliver things to my address.

Apparently, Yahoo! does not agree:

Darn, you, Yahoo!, now I have to reconfigure the internet.

Darn you, Yahoo!, now I have to reconfigure the internet.

So, what just happened here? Yahoo! does not want me to enter an invalid alternate e-mail and therefore sets up an invalid e-mail detector. And a false positive happens.

I hate false positives. Being allergic to some kinds of pollen, I have experienced the devastating effects of false positives in my own immune system. Someone (or something) is trying to be smart, but they are not, and it happens in a way that is obvious and frustrating. That this verification is utterly useless only adds more to the frustration.

What is Yahoo! trying to do here? I can see three possible explanations :

Trying to be smart

Maybe a pointy-haired boss thought “everyone validates fields” and asked for all fields to be validated even when it wasn’t necessary. Maybe a developer thought “validating all fields is a clever challenge”. Maybe the underlying libraries include a “mail verification” password that was programmed by an intern. Either way, the bottom line when you have an opportunity to be smart is, you better be really smart, or you’ll end up hurting yourself. There is no such thing as “pretty clever” when your code has to serve millions of people.

Making sure every account has a valid e-mail

Nobody trusts free e-mail in the business world. Posting anything even remotely related to business from a hotmail or yahoo address screams “amateur” unless you’re in an industry where merely having an address is unusual. The exception here would be gmail, which merely screams “my company can’t afford a domain”, but then again all our base is belong to google.

So it should be no surprise that providers of free e-mail would require at least some reassurance that the person creating the account is real. For instance, if it already has an e-mail address (never mind the possibility of confirming account A with account B and vice versa, leaving no trace of my actual identity).

But a mere syntactic verification is useless. I could write mickey1@mouse.com and then increment the “1″ until I ended up with a unique address that the system would accept. All you have done is delay the evil scammer for a few minutes, but the scammer doesn’t care because that’s just what his job is. But in the mean time, you got the syntax check wrong and hindered legitimate users that have other things to do with their time than changing their e-mail address so that they can get a Yahoo! account.

To weed out scammers and invalid addresses, it is necessary to send an e-mail to that address and have the user click on a confirm link. That is the one and only way to tell if an e-mail address is valid.

But once you start doing this level of verification, it suddenly becomes quite useless to do any other verification: you already have 0% false positives and 0% false negatives, adding another test can only increase the probability of a false positive, with no other benefit. Just accept the address as-is and start the verification workflow.

Making sure the user did not mistype their e-mail

I tend to read lists of e-mail addresses as part of my job, and the typical foobar@qux;com is a staple of French keyboards (’.’ is ’shift’ + ‘;’). Needless to say, if an user mistypes their password recovery e-mail, they’re in for a world of pain.

However, the correct approach to this issue is to provide a helpful warning, not an error message. Not only do you eliminate the risk of false positives in your regular expressions ever negatively affecting an user’s experience (like mine) but you can afford voluntarily introducing false positives that correspond to common mistakes but are not necessarily mistakes, thus making the feature even more helpful.

Instead of a nasty “Invalid E-Mail Address” message that begs the question “Who are you to decide that my e-mail address, hosted on my e-mail server and my domain, is invalid?”, a simple “You may have mistyped your address” warning that does not prevent submitting the form would be most welcome.

I can still remember the good old days when my computer asked me “Are You Sure?” whenever I tried to do something smart. Now, it just tells me “You Can’t Do That”, without the HAL 9000 voice.
Don’t believe me? Think how many lines of code you need to kill the operating system now, versus how many you needed in the good old days—the worst I managed was outside-allocated-memory access with CUDA.

I would argue that enterprise workflow systems push the “You Can’t Do That” logic to its final conclusion: anything out of the ordinary needs moderator intervention (if it is possible at all). This is both harder to program (as you have to clearly express what is ordinary) and harder to use in a cinch where something unusual must be done for the greater good. By contrast, a few permissive systems do exist : if what you’re trying to do can be undone then you are always allowed to do it, and a moderator is then notified about it and may choose to reverse your operation. Of course, some things cannot be undone (viewing or showing restricted information to someone, sending an e-mail to someone, and son on) and therefore require ex ante approval, but most tasks in a computer system are reversible.

Once you taste the pleasure of a “do first, be moderated later” system, it’s hard to go back to “your post will be online once it’s moderated”. Think about what Wikipedia would look like if it applied ex ante moderation…

So, unless you’re facing a critical situation, always give your users the benefit of doubt and perhaps a warning…

Engrish

It’s virtually impossible to visit every page of a website (well, except for very small websites). And until you try to visit a page or use a feature, you don’t know whether that feature works or not—that’s why testing software in general is so hard. You can’t really know if you’re visiting a cardboard town until you’ve visited everything.

For instance, there’s an MP3 download website called gomar-krakow.com (I’m not making the link clickable) that looks quite professional. That is, until you read the privacy policy :

The Way, what We Use This Information:

We use return email addresses to answer email, what we become. Email Addresses and other given did not collect the wide-spread third party. The Mask of the methods, using to guarantee that your email address is not displayed in clear text within pages of our site.

The School Music Forte is used to totalize, anonymous given for prepare to efficiency website and Music Forte Schools, marketed measures. The School Music Forte does not use such anonymous information for any other integer.

We never use or spread personally identifiable information provided us online in not having relations fetter on this described above.

Our Obligation In Safe Data:

To prevent the unauthorized access, support the accuracy data and guarantee correct use in information, he has reaching procedures to protect and provide information, what we collect online.

As that Address on Us:

If You have other questions or enxiety about this policy of secrecy, please send email to.

The Notice of Change:

The School Music Forte can modify this Politician Secrecy without notice anytime.

The philosophical implications of their poetry are quite moving. And the terms of use page is even worse:

Notice: Trying to get property of non-object in /usr/home/gomar/domains/domain/public_html/templates/mp3-archive/static.tpl on line 3

Notice: Trying to get property of non-object in /usr/home/gomar/domains/domain/public_html/application/modules/content/controllers/IndexController.php on line 46

Notice: Trying to get property of non-object in /usr/home/gomar/domains/domain/public_html/application/modules/content/controllers/IndexController.php on line 47

Notice: Trying to get property of non-object in /usr/home/gomar/domains/domain/public_html/application/modules/content/controllers/IndexController.php on line 47

It seems pretty obvious that those nice shiny “Terms of use” and “Privacy policy” links one very single page are there to provide the illusion of a professional website, even if clicking on them dispels that illusion quite fast.

The same happens when you try to download a product, as a captcha comes up and never registers your input. The whole deal looks suspiciously like a way to extract captcha recognition from clueless humans without anything in return.

As a web designer, recognize that humans are intrinsically shallow, and make sure that you provide the impression of a complete and professional website—just packing functionality together isn’t enough, you have to make it look complete.

As an user, realize that most people who have something to sell already know that cognitive flaw of yours and exploit it to the fullest, and make sure to explore everything just a little bit further to look for inconsistencies or missing pieces before you commit money to a product.

The Law

Many online communities follow an a posteriori moderation scheme to eliminate unwanted content. This improves participation, because anything is published immediately without prior authorization and therefore motivates contributors to contribute more. Besides, it also reduces the cost of moderation by letting the users themselves tag unwanted content for removal.

Of course, things can get nasty once a team decides to post a heap of porn videos to youtube (read the CNET article): an a posteriori moderation scheme involves keeping any content, no matter how controversial, online for everyone to view for at least a short while.

And things get even uglier when the community wants the content to stay online, but external forces move in to remove it: parents do not want their teenage children to view naughty videos, industry majors do not want their music to be distributed freely on BitTorrent, and your ex-boyfriend doesn’t want you to publish that picture of him picking his nose on facebook.

The technical question of tagging content for removal is pretty simple: if you have a central authority controlling the content distribution, create an API to mark some content to be reviewed and, if applicable, removed. If there’s no central authority, too bad.

The legal question, however, is somewhat more complex, and revolves around three questions:

  • How should the claimant request the removal of some content in a way that can be used as proof in court?
  • Under what circumstances should the community be forced to withdraw the content?
  • If the issue is taken to court, what charges can be upheld against the community and against the author?

This obviously depends on the country in which the community operates, since different countries are bound to have different laws.

In France, the main law on the subject is the LCEN—Loi pour la Confiance dans l’Economie Numerique, loosely translated as Law for Trust in the Digital Economy. It is quite restrictive.

First, whenever you use a public communication tool (such as a forum, a blog’s comments, or even your e-mail) you have to make your identity known to anyone reading your content. This means you have to provide your first and last name, home address and phone number. The penalty for giving out incorrect or incomplete information is up to one year in jail and a €75000 fine.

If you don’t want your public information to be freely available to everyone, you can provide all of that information to a service provider responsible for the technical infrastructure of your communication tool, and then provide the identity of that provider instead of yours. This is what happens when you send mail using a mail provider based in France: the provider knows who you are, and a court can find out who you are by tracking the e-mail back to its source and asking the provider.

So, what happens when you post a comment to my blog? Since I am based in France and am responsible for the technical infrastructure for publishing your comments, there are only two possibilities as far as the LCEN is concerned: either I am a mere hosting service, or I am the editor of your comment. The problem is that, as a hosting service, I am not allowed to edit the comments in any way, including for instance cropping out spam comments: everything published on the blog has to be posted as-is without my intervention.

Besides, as a hosting service, I am also required by law to ask you for your name, address and phone number, and I am pretty certain that you’re not going to give them to me. And even if you did, keeping that kind of information around means I have to sign up for a CNIL authorization (which is mandatory for keeping personal information) and must keep that information well-protected under penalty of one year in jail and a €15000 fine, and there’s no way I will risk a year in jail for such a silly reason.

So, I am an editor, and am therefore legally responsible for anything that gets published on my website, including comments from anonymous readers.

Now, do Facebook or Youtube behave as editors or hosting services with respect to French law? Not that it matters—the LCEN is only applicable when you actively set up shop in France to distribute content, not if you distribute content from outside the country.