A few days ago I read an interesting blog entry on Chris Shifflet's blog about Google Web Accelerator (GWA) and how it affects PHP applications. The purpose of the GWA is to accelerate the web page loading speed and thus improve user experience. This is done through a series of techniques which involve different caching mechanisms, periodically downloading copies of frequently accessed pages and prefetching.
The prefetching works on a basis of a premise that when you load a web page you will not view just this page, but also click of a few links from that page. So, rather then waiting for you to click those links, while you are reading the current page, the browser is prefetching the content of the linked pages in the background. By the time you decide to click on the next link, its content is already sitting in browsers cache and can be loaded instantly. Pretty neat trick, right?
While it is a neat trick, it does present several serious problem that affect both the webmasters and the users themselves. Let's start with the webmasters, since after all that's a bit closer to heart
Most modern pages, contain a fair number of navigation links, some for some pages like forum, the number of such links can easily exceed 50. On average the size of a page is about 40-50kb of HTML, sometimes larger, so while you are viewing page #1, GWA has is working in a background to fetch 50 * 40kb, around 2 megabytes of data. Given that many people have fast connections, such as cable modems and DSLs it would only take a few seconds. Now that you've finished reading the 1st page you click on the next page that loads instantly since it was prefetched and start reading it. On this page another 50 links are present, but let's say only 30 of them are new so only about 1.2 megabytes of data is prefetched. Now you've had enough and leave the site...
During your visit you've only used 2 pages, but your browser through GWA downloaded a total of 80 pages, totally around 3.2 megabytes of data. It has also wasted system resources necessary to generate all those pages, potentially slowing down page generations for other users. Unless those pages were static, generating content for those pages is not instant, so in addition to bandwidth excessive processing time is being taken up by the user. This means other visitors of the site may suffer slow loading times as there is less processing power to handler their requests.
Another problem posed by this technique is that many URLs may actually be action links, another words perform some sort of an action when clicked. This could be anything from "subscribe", "delete", etc... ultimately depending on what the application can do. Guess what, GWA will prefetch those links as well resulting in the execution of those commands. Google assumed, and we all know what they say about assuming, that people follow the W3C recommendation and not use GET links for "actions". The truth of the matter that most sites & applications do exactly that and many developers are not even aware of such a recommendation being in place.
Even in the best of times, this may pose a problem, for example let's consider a bulletin board situation. Most boards track whether or not you have seen a topic and its message; this is done by comparing the last view time of the topic to the latest message in it. When a user clicks on a link leading to a topic, the effectively mark all messages on the accessed page as read. Well, if Google automatically opens all links, you unwittingly end up marking all topics as read. Oops...
How can this madness be stopped? Fortunately every prefetch request is accompanied by a HTTP_X_MOZ header set to "prefetch". So if this header is present, you can make your application send "403" response affectively denying prefetch of the page.
While I am very unhappy about Google's auto-prefetch, the actual technique is not that bad when implemented properly. In fact gecko (Firefox) based browsers have had this features for a while and no one is complaining. The reason being, that these browsers only prefetch request that are marked inside the HTML code as safe to prefetch, thus giving the developer control over the operation. More details on the implementation can be found here.