Skip to content

Automatically determining PageRank, or, unsigned integers in PHP

Market Ruler, LLC develops software for web marketers – and as such, I’m always on the lookout for new technologies to make life easier on the PPC and SEO crowd.

I recently took the SEOMoz toolset for a spin, and in one of their tests, I saw that they automatically checked the Google PageRank of a site. Since I’m the type who likes to see how this is done … ahem, automatically, I dug into their system to see what they were doing.

All I could find was the URL used, which contained a request to Google, with some additional parameters in the URL, one of which was, ch which was set to a really big number.

This is a checksum, and varies based on which URL you are checking. Google implemented this to prevent automated queries, however, they release their code out into the wild (their toolbar, for example) and so some enterprising engineer reverse engineered it, and their “security” is now worthless.

I found, shortly thereafter, this Perl Module to check PageRank which outlines how it’s done.

Since our PPC Tracking Software is written in PHP, I thought I would quickly port the module to PHP. Problem is, it was kind of a nightmare.

As a fair warning, severe geek talk is approaching rapidly.

As well, I should mention that this type of code is technically against Google’s Terms of Service. That is, they request no automated queries as part of their terms of service. Don’t say you haven’t been warned. Rumors of Google banning IP addresses, or user agents who use Web Position Gold, are very much true.

I’m debating on how to use this without making it “automated” – does that mean that a human being has to initiate the action? If that’s all, it may be possible to do without violating their terms.

Anyway, this long digression gets into the specific problem I had. (Here comes the geek talk …) The checksum algorithm works, roughly, as follows:

  • Checksum the actual URL
    • Convert the string into ASCII codes: A = 65, Z = 90, etc.
    • Convert the first 12 characters into 3 unsigned integers, and add each to a “magic” number which accumulates
    • Run the “mix” step which shifts bits around in each integer and combines them with the other integers
    • Complete on the next 12 characters until you run out of characters
  • You’ll get back a big number.
  • Then take this big number, and subtract multiples of nine from it 20 times (0, 9, 18, etc.)
  • Convert these big numbers into 4 characters each, and combine them into an 80 character string
  • Run the checksum (above) on the new string
  • Prepend the number “6” to the final number (denoting the version number, probably) and you are ready to go!

The details can be easily found in the Perl module above. And yet, I still haven’t divulged my problem: PHP is terrible at handling unsigned integers properly.

That is, PHP is typeless, meaning it does it’s best to convert types to the most appropriate context depending on what you’re doing. This is usually great, except when it’s not.

That is, in this case, when you do: 51234231 << 13, the final number is greater than 2147483647 (the maximum value for a signed integer) and so it gets converted to a double.

The problem is that when doing various bit-wise manipulations (as this algorithm does), PHP just pukes left and right. Depending on the sign, and the number of bits you’re manipulating, it converts from integer to double seemingly randomly.

The solution for this, if you ever encounter it, is to avoid using integers at all.

I defined a class called “ulong” which has methods bit_and, bit_or, bit_xor, etc. which allows me to simulate the unsigned long properties without the automatic conversion problems of PHP. The crux is that I break the long into two “short” parts: 16 bits each. Then I map all of the operations correctly onto them.

There may be an easier way, but after getting more an more frustrated by having to convert to integer and back to double and using fmod instead of bitwise ands, and having PHP come up with seemingly random results.

Note that this has only been tested in a limited environment, and is not suited for anything high-performance (such as cryptography).

If you find bugs or tweaks, let me know. You can download the source here:

If you find this useful, please comment!