← Jotting Down | Groups and Users →
So far, we have not concentrated a lot on what we were outputting back to the user. There are three things that have not been paid any attention so far:
- There’s no reason to believe that the HTML output by the system is indeed, as advertised, XHTML 1.0 Strict. Does it matter? I tend to look at XHTML validation in the same way I look at the warnings of my C++ compiler—most warnings that crop up in code are not really dangerous, precisely because the compiler cannot determine for sure that they are and thus throws around a lot of false positives, but if your warning output contains a thousand lines you will not be able to notice the one line that’s actually relevant and dangerous and should be corrected. By giving up on correcting warnings as you go, you give up on the ability of your compiler to identify potential pitfalls for you. And, in exactly the same way, I tend to consider that giving up on validation (because many things that prevent you from validating don’t really harm the project) makes it impossible to use a validator to detect that one error that causes trouble.
- There’s a lot of code flying around in the views that escapes the arguments using htmlspecialchars—the point is to prevent HTML injection attacks by making sure that every piece of data ever output as HTML is either sanitized user input or hand-checked HTML markup. However, some arguments in the render functions are assumed to be unsanitized data and are thus escaped, and other arguments are assumed to be sanitized data and thus kept as-is without escaping. A striking example would be some of the errors containing links (the “retrieve password” link in the “user already exists” signup error for instance). While you can reasonably trust the programmer to make sure every piece of unsanitized text is sanitized before being turned into HTML, you cannot expect them to guess on their own which arguments are going to be turned into HTML and which are expected to already be HTML.
- The views contain hard-coded text in the english language. The longer we spend writing code with hard-coded english text, the harder it will be to extract all that text from the file. Conversely, if we start planning for localization early on, there are techniques that add almost no programming overhead in the code yet support multiple languages freely.
Let’s tackle this in order. Solving the validation issue is not insanely hard: most obvious errors come from mismatched tags, and most non-obvious errors come from using non-standard attributes or messing up with block elements, inline elements and their relative positions within the tree. A quick visit to th W3C validator is usually enough to get a clean list of these and correct them on the fly. This is easy to enforce if your views are set up correctly: always close a tag in the same PHP scope you open it in (so no implode("</li><li>",$array) please!), and always keep track of the kind of element every function outputs (I assume block by default, inline when specified, which means when I forget to specify an element is inline I don’t risk anything).
Solving the HTML sanitization issue is a tad harder. It’s impossible to rely on the human programmer for this: I always feel a little bit queasy when I feed a value from an argument directly into a process that assumes it’s valid: what if it isn’t valid? I always think, and what happens if some developer, perhaps even myself, forgets about this and feeds an invalid argument three months from now, and a cracker finds out about the vulnerability in the software six months from now and I have to apply a patch universally?
On the other hand, PHP isn’t compiled, so I cannot rely on compile-time type safety like I could have in, say, C++ or Objective Caml. The only kind of safety I have in PHP is type hinting. Since I want to have PHP tell me “this should be sanitized but it isn’t”, I have to create a sanitized HTML data type. Then, whenever a function expects a value to be sanitized (because it is used as-is inside the function to create HTML) it should declare that value as the appropriate type and type hinting will do the rest. If type hinting does not apply (for instance, checking an array of sanitized strings, like our two-column form view) I can always use instanceof.
However, that would be thinking like a static language programmer, which is not a good idea. This forces the user to always manually create a sanitized HTML object whenever he is about to call a function, which means that having to handle HTML at any point in your code will be propagating autohints in a lot of arguments and causing a lot of trouble.
An alternative is to go down the dynamic route. Say you’re generating some HTML. If the argument you received is sanitized, you use it as such. If it isn’t sanitized, you escape it, then use it. By default, data is unsanitized (for obvious safety reasons: you don’t want hell to break loose if you forget to declare a string as unsanitized) so we design a specific type to represent sanitized text. It goes like this:
- You can create a verbatim sanitized text. This will leave any HTML markup as-is. This is what you do whenever you need to create some HTML-decorated data in your code.
- You can create conditionally escape text: anything but sanitized text is escaped (but sanitized text is assumed to be safe, so it’s kept without being escaped).
- You can force escaping in all cases, even on sanitized text (because you never want HTML markup to appear in an attribute value, for instance).
The implementation is:
<?php // objects/html.php class HtmlObj { private $_text; // Verbatim HTML public function __construct($text) { if (func_num_args() == 1) { $this->_text = $text; } else { $args = func_get_args(); array_shift($args); foreach ($args as & $arg) $arg = self::Escape($arg); $this->_text = vsprintf($text, $args); } } // Escape HTML (if it isn't already) public static function Escape($text) { if ($text instanceof HtmlObj) return $text; return new HtmlObj(htmlspecialchars("$text")); } // Always escape HTML public static function Force($text) { return new HtmlObj(htmlspecialchars("$text")); } public function __toString() { return $this->_text; } }
For most purposes, such an object will behave like a string (because it is automatically converted to its value in a string context), but the escape function will play its little magic. The basic idea is this:
You can escape all data across your entire system in every place that outputs the contents of a variable without every asking yourself a single question. It’s that easy. All you have to do, when you need to define some HTML to be displayed later, is wrap that HTML with the appropriate object, and it will appear as such when it’s displayed. You can even store it in the session transparently, and serialize it in the database.
A few practical examples for what we’re doing here. Our reset-password controller uses a reset password to display an error. With our new design, we write the relevant part of this view as such:
if ($error != '') { ?><p class="error"><?=HtmlObj::Escape($error)?></p><?php }
That’s it: no cleverness, no asking yourself whether the error may contain HTML markup that you may want to display. Besides, the error does, in some cases, contain HTML:
if (AuthenticationModel::isMailAvailable($mail)) { $url = DomainConfig::Url('/login'); on_error(array('next' => $get['next'], 'for' => $mail), new HtmlObj('Unknown e-mail %s. <a href=\"%s\">Register</a>?', $mail, $url)); }
The HTML object infrastructure guarantees that if I mark some text as containing HTML, it will be rendered correctly across the system, no questions asked, and it interacts beautifully with the session too. Note that the printf-like approach used by the constructor will escape any argument but the first, to that even if the user entered a mail that contains HTML, it will be escaped.
So, as long as you don’t forget to escape all variables right before inserting them into an HTML stream, you eliminate all risks of script injection. Of course, if you forget to mark some HTML as being safe (by using the constructor) it will be escaped by the system, but this is not a security vulnerability, merely a cosmetic issue that you can correct on your own.
Issue solved. The next problem is localization. It’s usually split into two ditinct concepts:
- Letting developers enter text in an arbitrary language at any position in your system. This means the HTML, the error messages, the data displayed by JavaScript, and so on.
Mostly, it means being able to translate the text. - Accepting, and handling, the fact that not all languages on earth are left-to-right languages. Arabic and Hebrew, for instance, are written right-to-left, which means any latin-centric left-to-right reflexes must be avoided.
Handling text direction is a complex issue that I will not be dealing with: first, I am not fluent in any RTL language, which makes me a bad judge of whether something is correct or not. Second, any content-driven website will inevitably end up having both RTL user-entered text in an LTR context and LTR user-entered text in an RTL context, which is a difficulty I do not wish to handle even though I expect the Unicode Bidirectional rules to manage that reasonably.
This leaves us with the first issue: localizing arbitrary pieces of text on the website. Since by default the text is present as string literals within the application (mostly inside views and controllers) the localization process has to be decided early on, or the search-and replace throughout the entire application will be too difficult. My strategy is the following: wherever there is text to be displayed, the programmer is provided with a specific variable called $_, which happens to be a function that returns the translation of its argment in the current locale. So my average view function turns into:
public static function RenderInitialContent($_, $mail, $error, $next) { if ($error != '') { ?><p class="error"><?=HtmlObj::Escape($error)?></p><?php } ?><p><strong><?php echo $_('To retrieve your account password, please enter ' . 'your account e-mail address below.'); ?></strong></p><p><?php echo $_('You will receive a confirmation e-mail at that ' . 'address containing a link. Follow that link to ' . 'change your account password.'); ?></p><?php $c = new TwoColFormView(); $fields = array($c->Input("mail", $_("E-mail"), $mail), $c->Submit($_("Send e-mail"))); $target = DomainConfig::Url("/do-reset-password", array('next' => $next)); $c->Render($target, 'mail-form', $fields); }
If you’ve already done PHP localization with the gettext extension, you might be wondering why I’m using “$_” instead of the classic “_” alias of the gettext function. The explanation is actually that my definition of current locale is slightly different than that of gettext. There’s no reason why the current locale inside a view function should be the same as the global locale for the current request: imagine an English-speaking user causing the system to send a mail to a French-speaking user, and you’ll see that the function that renders the mail has to be able to use a distinct locale from the global one.
Consider now that potentially any piece of HTML might potentially be included in an HTML report sent by mail to several users with different language settings, and you’ll start noticing why I’m doing this.
As a whole, it does not take a lot of additional work: every single view function takes as it first argument the translation function, and I’ve added a line to the pervasive file (included before every script run) that initializes the global variable “$_” with a function that returns its argument as-is. When more languages will become available, I will change this function so that it depends on the user settings instead.
← Jotting Down | Groups and Users →
Hi. I'm Victor Nicollet,
0 Responses to “15. Localization”