Language Statistics

I recently stumbled upon a study (article) of similarities between languages based on the frequency of two-character sequences within that language. The basic premise is that the words of different languages are written in different ways: the sequence “ch” is more frequent in French, while “sh” is more frequent in English. The probability matrix of all two-character sequences was called the Statistical Language Signature, and was used in that study to build a tree representing the evolution of languages.

Could a similar technique be applied to programming languages? For instance, could an analysis of two-character sequences identify a text as being written in a programming language instead of another?

I have written one such statistical analysis program and extracted the number of occurences in code databases of tens of millions of characters for various languages. These are not probabilities of occurence, but rather the actual numbers, sorted in descending order:

C++ (.[ch]pp) Perl (.pl)
PHP (.php)
OCaml (.ml)
LaTeX (.tex) SH (.sh)
Assembly (.s) XML (.xml) LISP (.el)

Data was gathered by concatenating together files I could find on my hard drive. As such, there is probably a statistical bias (for instance, related to the fact that some OCaml files were generated by Menhir and not hand-written). Also note that white space is mostly ignored (so languages such as make or Python, which are whitespace-sensitive, might yield odd results). Let’s consider a few interesting lines from, say, OCaml and PHP:

     1	  1041163 : a-z a-z
     2	    61686 :   _ a-z
     3	    55004 : a-z   _
     4	    45553 : A-Z a-z
     5	    31807 : 0-9 0-9
     6	    27180 : a-z A-Z 

     8	    20599 : a-z 0-9
     9	    18215 : A-Z A-Z
    10	    16455 : 0-9 a-z
     1	   823427 : a-z a-z
     2	    50294 : A-Z A-Z
     3	    35603 :   $ a-z
     4	    34804 :   -   -
     5	    33476 :   =   =
     6	    31877 : A-Z a-z
     7	    24959 : a-z   (
     8	    18658 : a-z A-Z
     9	    15675 :   _ a-z
    10	    13769 : a-z   _

The first noticeable element is the “$a-z” sequence (rank 3, right) that is particular to PHP (and, to a lesser extent, Perl). Another is the difference between PHP ALLCAPS constants on the one hand (rank 2, right), and the Constructor_name approach used by OCaml (ranks 2, 3, 4 left).

Other points of interest:

  • TeX has the largest number of different two-character sequences (nearly 1200), followed closely by Emacs LISP (1150) and Perl (1130). Assembly has the lowest, with less than 300.
  • SH has a high frequency of “/a-z” and “a-z/” sequences, possibly related to path manipulation.
  • The highest-ranked non-alphanumeric sequence in C++ is “//”, the comment sequence. Also, the increment operator ‘++’ is seven times less frequent than the decrement operator ‘–’.
  • The frequency of empty string literals is unusually and significantly higher in OCaml (perhaps due to typical recursive or folding manipulation).

Feel free to peruse the above as you wish, or use the statistics program to process your own data sets.

0 Responses to “Language Statistics”


  1. No Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>



1150 feed subscribers
(readers who polled a feed this week)