Tag Archive for 'The Blog'

Smart Spamming

I found an interesting comment on my website today, for the article on last-minute-skinning of a page in HTML from some Javascript. It looks pretty sane:

CT — October 5, 2009 at 22:15

Interesting stuff. I don’t relish the idea of taking the vile HTML our designers produce and creating the skin files. Nice proof of concept though – I’ll have to keep an eye out for an excuse to use it ; )

This comment, while completely adequate and relevant to the article, is spam. How do I know? First, the provided website is a classic credit-rating-improvement web portal. But should I prevent people who work in the credit spam industry from posting relevant comments on my articles? Well, there are other comments on that article, too, such as:

Tom Milsom — September 8, 2009 at 11:41

Interesting stuff. I don’t relish the idea of taking the vile HTML our designers produce and creating the skin files. Nice proof of concept though – I’ll have to keep an eye out for an excuse to use it ; )

So, it looks like the spam-bot found an earlier comment on the article, copied it verbatim, and posted it with a different link. This would ensure that, if the spam domain is fresh enough not to register as such, the Akismet spam detector would let the comment go through unscathed based on its content alone. And as a human, if I did not pay attention to the author’s website while reviewing comments, I would let it go through as well because the comment would look sane. I don’t remember comments from one month ago, and I guess many people don’t.

Everyone enjoys advertising if they are looking for, or otherwise interested in, the product being advertised. I discovered Cushy CMS because it ran an ad on The Daily WTF, and I am quite happy with the discovery because I was looking for such a product. And nobody enjoys advertising for products they don’t need—I don’t give a cheese about US credit ratings. I have limited space on my screen that I’d rather not fill up with advertising about things I do not need, and my time is even more precious than that.

This spam comment blurs the line between spam comments that are irrelevant to the discussion and point to websites irrelevant to the readers, and ham comments that are relevant to the discussion and point to websites that are relevant to the readers (by virtue of usually being run by the author of the comment and thus sharing at least some elements).

Suppose that tommorrow, someone posts an original and interesting comment on one of my articles, yet links it to a credit rating website. Should I accept the comment as such, block it, or publish it without the link?

One of the main reasons why people comment on the blogs of other people is to improve their visibility on the internet. If I post a comment on a well-known blog, hundreds and thousands of people will browse over that comment, a small percentage of these will find my writing worthy enough to follow the link and end up on my blog, and an even smaller percentage will become regulars, posting comments and subscribing to my feeds. Which is good, of course, because the more comments I get on my blog, the more interesting it becomes.

This means that commenting is often quite similar to advertising one’s own blog or website. People allow commercial advertising on their blogs (ad banners and such) to get money in return, and they allow personal blog/website advertising on their blogs to get comments in return. So, I guess if an irrelevant website was linked to by a genuinely interesting comment, I would publish that comment (of course, restrictions do apply: I would not allow all websites, just like I would not allow all ad banners).

I like the blogs with good comment advertising—where I can browse the comments and find links to interesting websites.

Tangane Blog

My employer, Tangane, has recently opened a corporate blog [fr]. I will be one of the editors, so expect to find articles from me regularly—the target audience is not as technical as the one for my own blog, so the posts will probably be more about general considerations and strategy in IT.

Sleep(1)

Not a lot going on right now—most of my free time is dedicated to a secret project. In the mean time, a few old yet useful links I had laying around:

Blog Refactoring

The old categories of “Functional Tuesdays”, “Dynamic Wednesdays” and “Imperative Fridays” are getting increasingly restrictive, mostly because they were though out quite quickly when I first started my blog five months ago. The main problem is that, while I do have things to talk about, quite often it’s a matter of one category being full of ideas and the other empty (with the inspirational category changing over time). This has obvious effects:

  • I don’t have an excellent idea for a column, so I’ll write a short article on an idea that’s good, but not awesome.
  • I can’t stay passionate about a topic for weeks, nor can I write everything in advance and still manage to fill the less passionate columns in time, which means a lot of ideas are just thrown away because they don’t fit my publishing schedule.

Another reason why I’m giving up on these is that Wordpress seems to have wiped away all my categories. There are no more categories either on the public website or in the back-office, and I won’t be hunting for them in the code or database anytime soon.

Also, I won’t be changing the regular schedule: Tuesday, Wednesday and Friday remain fixed publishing days, although if I have too many things to say, I will add them anywhere during the week. This way I still have a quantity constraint (or else I won’t get anything done) and the fixed days will prevent me from writing the articles in a single sunday evening rush.

So, What’s Your Name?

I’ve been having a lot of name-related problems, lately. Not my own name, but the names of various things I have to work with. Nothing unexpected: in the computer world, we tend to use names everywhere.

Every time there are objects to be manipulated by programmers, there’s a way to give names to these objects. It could be the “id” attribute in XML, it could be the name of a file in a filesystem, it could be the name of a variable or member in a program, or it could be the hostname of a web site… A name is, in theory, a set of characters from a predetermined alphabet (that varies based on the application) which is bound to an entity of some sort. Once you hold the name, you can get the associated entity directly.

This theoretical definition leads to two distinct issues:

  • There’s only a small number of acceptable names around. Of course, generic names like “clxbpf8990″ can be used (and if you look at the ids generated by generate-id() in XSLT, this is exactly what happens) but the entire point of a name is to allow people to retrieve content based on the name. As far as the internet goes, there’s only a finite number of names any user can remember—and web users have become increasingly reliant on google and the recently added Firefox 3.0 address bar to find sites with non-obvious names. How do we handle the inevitable collisions?
  • A programmer in Australia has a global variable named Foo in a C program. A programmer in Sweden, working on a different C program, also has a variable named Foo. Should the two collide? What about machine names on local networks? User names on the same machine? Not all names are global, which means that the scope of names has to be determined, and ways of handling collisions between local scopes must be invented.

Yet, it’s not all about collision. The most important element of naming is the ability to find the data bound to a name. This is what directories are for.

The simplest form of directory is the phone directory: you have a name on the left, and a phone number on the right. The directory also happens to be sorted by ascending key order so that dichotomic search can be used to find a given name in logarithmic time (how clever—in the age of search engines, we tend to forget logarithmic search times existed since the days of dictionaries and encyclopedias). Such directories exist in the computer world. The simplest would be the classic user directory, accessed through LDAP (Lightweight Directory Access Protocol) or perhaps ActiveDirectory, but these are not the most frequently encountered by the common user. Another simple example is /etc/passwd:

username:!:100:100:office:/home/username:/usr/bin/sh

User name is on the left, various user-related information is to the right.

Domain Name System

Let’s crank the complexity lever up a little bit. What classic directory format do we use several times a day? The Domain Name System (which most of you know as DNS). Everyone sends out queries to a DNS server, asking for the IP associated with a name. That is, when you type “www.nicollet.net” in your browser, the browser needs to know what server to connect to. The problem is that looking for a server is done by routers, which are usually not very smart: they have routing tables based on masking the IP address (whether that address is IPv4 or IPv6) so they need that address to begin with. So, the browser instead resolves the domain name to an address. This involves looking at the local “hosts” file to see if a definition exists (for instance, “localhost” tends to be bound to the loopback address “127.0.0.1″), then queries any local name services the operating system may provide (this allows you to connect to an HTTP server running on another machine on your network by using that machine’s name, without having to set up a local DNS server or registering with a public one), and finally sends out a query to a DNS server somewhere on the internet (the address of the DNS server is either provided by the ISP itself, or manually entered by the user in a configuration wizard or file). The DNS then returns the address for the domain.

This is where two of my recent name problems came from. As you might remember, one week ago I had to move my blog from one server to another, and in the process, I had to change the DNS entry so that reads who typed in “www.nicollet.net” connected to the new server instead of the old one.

The first issue was that DNS propagation is not instantaneous. Back in the days when the intertubes were often clogged, caching played a big role in avoiding too many DNS queries moving up to the reference DNS server for a given top-level domain. The downside is that when you type in “www.nicollet.net”, you’re not really asking the reference DNS server for domains ending in “.net”, you’re asking the DNS server provided by your ISP, which may decide that its copy of the “www.nicollet.net” address binding is correct, without even asking the reference server (this is what caching is for). So, it took about five hours for all visitors to be correctly redirected to the new website. If any of you posted any comments, they went to the old website and were lost along with it—sorry. Of course, as you can expect, this can get a lot more problematic once you have a highly interactive website. So, you tend to choose an IP address and stick with it (or, if you have to change it, then you have it forward everything to the new one until nobody uses it anymore).

This also meant that any e-mails sent to foobar@nicollet.net were routed to the old MX binding, “mx1.ovh.net” (for those who wonder, a DNS entry contains several bindings: a web browser would look at the main binding for that domain, while a mail delivery program would look for the Mail eXchange binding) even though I was expecting them on the new MX binding, “nicollet.net”.  This took a short while to be sorted out.

The second issue was with the name of the server. See, being listed in a directory doesn’t change your name: it merely gives you a name by which you can be found. So, the name of the famous emperor of Wei is still Cao Cao, despite being posthumously named Wu. And the name of my new server still was r17474.ovh.net despite being now referred to as nicollet.net by the DNS. Then, of course, when some mail arrived for the user foobar@nicollet.net, my qmail-powered server quite naturally answered “there’s no user named foobar@nicollet.net here, please go away” in perhaps not so polite terms. So, I still had to convince my server that it was now known as “nicollet.net” so that the mail delivery could work. Nothing that the “hostname” UNIX command couldn’t solve.

What about local area networks? How come that you can access another machine on your network by using its name? There usually isn’t a central naming service which can be queried by computers on a LAN, and it wouldn’t be practical because it would have to be manually configured on every new computer. Instead, computers on a LAN use NetBIOS to set up local directories: whenever a new computer is hooked up to the network, it broadcasts its hostname and address to everyone else, and everyone else writes down the association in a local directory. Conflicts are resolved quite simply: if two computers have the same name, the last to broadcast is the one that everyone else remembers (this is quite useful if you have to move from an office to another and get a new IP as a consequence).

Names in Languages

Programming languages all provide the user with ways of naming entities that are significant, both for documentation purposes (it’s easier to understand what a value is when it has a reasonable name) and for cross-referencing data defined in one place and used in another.

On the one hand, you have the static compile-time approach. This is the easiest to work with by far, because all the nitty-gritty details are written down in the documentation of the compiler and the build scripts of your application. For instance, most compilers allow the definition of “include paths”, a list of locations on the filesystem where the definitions of objects can be found. When the compiler needs something (an included file in C or C++, a class in Java or Actionscript, a module in Objective Caml), it will look for an appropriately named file in all these locations.

Let’s consider the Objective Caml compiler, “ocamlc”. Whenever a source file contains a reference to another module (by using a symbol in a module context, such as “ModuleName.member” or “Functor(ModuleName)” or “open ModuleName” or something like that), the compiler looks for the compiled interface of that module. This is done by looking for the file “moduleName.cmi” (which is generated manually by running “ocamlc moduleName.mli”) in all the locations configured with the compiler: first, the current directory, then paths specified through the environment variables, then include paths specified with the -I command-line argument. If you’re only compiling the module (flag -c), the result is a cmo file (or cmx file with “ocamlopt”). At link-time, you must then specify all the cmo files required by the program, and the compiler resolves all links based on the name of the cmo file (for simplicity, it’s possible to group together several cmo files as a single cma file : they are simply concatenated together, so using the cma library is equivalent to adding all the cmo files it contains manually).

Java goes a short step beyond this by eliminating the need for a link step. When you import a java class, using the “import abc.def.Foobar;” statement, java understands that by looking at the relative path “abc/def/” it will find either a “Foobar.class” or a newer “Foobar.java” that it can compile to “Foobar.class”. Here, the path is specified relative to one of the include paths (which Java calls classpaths) and can be a normal filesystem path or a path within a JAR archive.

C and C++ are both the simplest and the most complex. In these languages, every reference is explicit. The first step is compilation, where files are included in other files by specifying the relative path. The second step is linking, where libraries and objects are included by specifying the relative path. At each step, additional locations can be specified to look for headers.

PHP has an interesting dual approach to things. On the one hand, its normal cross-file interpretation system consists in include() and require() statements which look for the named file in any of the specified include paths for PHP, which makes it look like SH, C and C++. On the other hand, PHP has also introduced functionality for resolving class names: when a class is used (”ClassName::Member”, “Foobar extends ClassName” or “new ClassName”) but the class is not defined, a special function is called. This lets the user specify an alternative loading scheme, such as looking for a file named ClassName.php and including it. The Zend Framework makes heavy use of this, meaning that the inclusion approach is specified once in index.php (or a common configuration file) and then no other inclusion is used for class filed (arguably, Zend_View still includes phtml files explicitly).

On the other hand, there’s runtime access. This is harder to work with, because there’s less documentation available. Sure, some concepts, such as dynamic linking, are fairly well-documented mechanisms that look for the named file in the current directory first, then in other directories specified as environment variables or system-wide configuration elements (in the case of java, in the classpath, for example). However, even then, some programs insist on working on their own.

I recently had an issue with Alfresco, namely alfresco-mmt.jar which terminated with a ClassDefNotFound exception. Looking for solutions on the internet, I found out that the class it was looking for was defined in a JAR nearby, so I adapted the classpath when running the jar so that the class could be found. Except it couldn’t. The problem with this was that the exception did not specify where the class loader had been looking for the class (or perhaps it did internally, but as an end user with no access to the source code or, probably, a debugger, I couldn’t see it), so I had no way of understanding where the JAR with the class should have been placed. It turns out, the application loaded the JAR from inside its own JAR, so I had to place it there.

Welcome Back

My Christmas Gift of 2004, from my parents, was the nicollet.net domain name and some lightweight hosting. The old nicollet.net server (well, actually, small part of a server) served me well over the years, with over 100k visitors:

Total Visitors 107,104
Total Pageviews 1,318,695
Total Hits 1,650,862
Total Bytes Transferred 14.54GB
Average Visitors Per Day 74
Average Pageviews Per Day 911
Average Hits Per Day 1,141
Average Bytes Transferred Per Day 10.30MB
Average Pageviews Per Visitor 12
Average Hits Per Visitor 15
Average Bytes Per Visitor 145,756
Average Length of Visit 177sec

Not really impressive over such a long period, but until August 2008 the site didn’t contain much beyond a few files I needed on other computers, some pictures and documents I wanted to share, and perhaps small snippets about development that I uploaded once every two months.

Then, on August 2008, the blog started. I had recently uploaded a fresh install of Joomla! for managing my files, and since it more-or-less handled blog-like layouts, I tried to write a blog. That was a mistake, of course: the default install of Joomla!, even more so the 1.5 version with a lot less support for external modules and plug-ins. The best free component for comments I could find was Chronocomments, which didn’t work for all my visitors, but did work for many spammers: starting from November, I received more spam on my website than through all my mailboxes combined, totaling around 900 spam comments (and only 8 legitimate comments).

Either way, through some advertising (mainly, by writing useful things on my blog and then linking to my blog from forum threads where those articles would answer the question being asked) I managed to get a reasonable number of visitors (around 60 daily non-bot visitors).

The visitors for december are only counted up to the 22nd (for some reason, the log analysis tool doesn’t want to show more than that.

What now?

As of today (December 30th), I’m moving on to a new server. The old one supported PHP4 and an old version of MySQL, and didn’t have much in the way of freedom (only FTP access was allowed). The new one is a good old dedicated server running Debian Etch 4, on which I’ve installed myself Apache2, PHP5 and MySQL 5. I have also moved from Joomla! to Wordpress for my blog management software.

The transfer is not yet complete (I spent five hours yesterday converting all my blog articles from the old format to the new one) and I couldn’t transfer the comments. My e-mail address (firstname@nicollet.net) is down for the time being, because I have to install a new mail server myself and I won’t be able to get around it until the end of the week. I also have to re-upload my teaching material, which will also be done by next week.

The URLs are still a bit wonky (the DNS transfer was completed after Wordpress was installed, so the server still calls itself by its old name, r17474.ovh.net) but should be ironed out soon. The URL naming scheme I decided to use in Wordpress (nicollet.net/year/month/title) is different from the one in Joomla! (nicollet.net/blog/category/id-title) but I am in the process of setting up the appropriate redirections from the old site to the new one so that no articles are lost.

As far as the RSS feeds, however, I’m afraid they’ll be discontinued (partly because of the silly URL structure in Joomla! and partly because the structure of the new blog version itself has somewhat changed, and now includes articles independent of the current day, such as this one). Make sure you scroll to the bottom of the page and get the new global feed (or just click here).

Happy new year, and good reading!