A Self-healing Web
Note: I originally wrote this article before BarCampLondon. After my presentation at that event and the dicussion following it I’ve rewritten the article with the current ideas which seem most appropriate. This is still intended to explain the idea and I’ll follow up with the best of the suggested implmentation ideas.
There are a number of errors that can occur on the Web. A 404 error (file not found) is the most common but other errors or non-errors like a 301 (permanently moved) are also common. You have probably experienced both of these responses and more whilst surfing the web.
Why is this a problem? Well, in the case of a 404 a visitor has tried to access a resource, for example a web page, which is no longer available. This could be for a number of reasons, because a 404 is a generic error. It means the web server can’t find the requested resource. The reason could be a typographical error in the name of the resource or it might have been deleted - all the web server knows is it isn’t there. There are more specific errors for page deletion but these are seldom used, even by content management systems. In the end, this means that visitors get a pretty bad experience. A real life equivalent would be to ask for directions to the cinema and when you arrive where you thought the cinema was you find a sign stating “I’m sorry the cinema isn’t here, I’m not sure why, or where it is, I’m not sure it was ever here. People keep asking me, but I just don’t know.”
The best kind of error you can expect is a 301. This response indicates the resource has been permanently moved to another location. The web server knows what the original address was and what the new address is. The web browser should never ever use the old address again thank you very much. Of course, the visitor doesn’t really see much of that. There is an HTTP request from the browser to the web server, and the web server passes back some HTTP headers with the 301 and the information about the new location. Then the browser makes another request and the visitor gets the page they wanted with a different address in their URL bar, without any extra notifications or unnecessary clicking of links.
”Good”, you might say, “the visitor didn’t get an error message and ended up where she wanted to go”. True, but on the other hand the visitor’s browser just made one HTTP request it didn’t need to. If the visitor requested the old address again, their browser wouldn’t even bother trying again and go straight to the other address. The 301 it got the first time tells it the resource has moved for good, no need to try again. So, why should a web page send a visitor to the wrong address more than once? If we go back to the example of directions to the cinema a 301 is like a sign saying “The cinema has moved one block over to the right. We hope to see you there!” It sure is a lot better than finding no help, but wouldn’t it be even better to get up-to-date directions instead of being sent via the old address?
In an ideal world visitors would never get any errors. The resources would never be inaccessible, require credentials or any of the other possible causes of an error. Visitors would never be sent to a resource they couldn’t use. Unfortunately the strength of the web is that no one person owns all of it. This means that there will always be errors as the addresses to access resources shift and change.
The problem with the way we deal with errors right now is we tell the visitor that there was a problem but not the resource they used to get to the error. That’s like the sign telling each person where the cinema moved to. The sign is ok, but wouldn’t it be much better if the people in the cinema discovered who was sending so many people the wrong way and told that person the new location of the cinema? In the real world this would be pretty tricky, and quite a lot of work.
On the web this is a heck of a lot easier. In a normal configuration. the visitor’s browser tells the web server which page referred it. This is how web tracking works. This new suggestion is that when an error occurs the web server which is giving the error to the visitor also notifies any referrer it gets about that error. This is like telling the person giving bad directions where the cinema is. We don’t know if they will stop giving bad directions, but they might possibly start giving good ones instead.
The format of this notification could vary. There are a number of commonly used server to server communication protocols. XMLRPC and SOAP are two possible options; XMLRPC particularly is widely used in the blogosphere for ‘trackbacks’.
When a web server receives a notification they are linking to an error the response should differ from error to error. A notification of a 301 could cause a resource to automatically update to use a new URL, because after all it is just a new reference to the same place. A 404 notification, however, should do something more complex. Since the 404 is such a generic error, there is no new location to use, and you can’t just delete a link. The best course of action is probably to send a notification to the content owner to re-examine the link they used in their content.
All this is simple enough, but there are some more things to consider. Most importantly authenticating these error notifications. If a stranger told you that the cinema really isn’t where you thought it was and you should really send people some place else would you believe them? This stranger could be sent by a rival cinema, or a night club. Unless you know for sure, why should you believe them? This is the real life equivalent to the ever present problem of spam on the Web. It’s pretty simple to get sorted though; all you need to do is ask the stranger for an ID, that proves that they work for the cinema. If that is not enough (IDs can be faked, too), you could just go visit the new address. As always it is much easier in the computer world: you can use a Reverse Domain Name Service (RDNS) lookup (of the IP of the notification) or attempt to access the resource in question.
We know that we want to send a notification to a resource telling them there is a problem with the links they are providing. But there is a question of ownership. The ownership of domain.com/me may be entirely independent of domain.com/you. This begs the question, whether notifications should go to some global controller or directly to the resource in question. A global controller at the root of the domain has some advantages: much of the blogging community already uses global level controllers for things like track-back. However if the ownership of subfolders on a domain is diverse then the global controller would have to know about each owner, or at least how to delegate to a sub-controller. Alternatively, the notification could go directly to the resource. In this case however requests still have to adhere to a HTTP request. This means all resources on a web server need to be handled by the controller, which includes for example POST requests. This could be configured as some kind of global mask, but it starts to become more awkward. Realistically, it’s probably more sensible to try both and let the web decide which works best with plain ol’ market forces.
If a response at either the global or the resource level is not received then the other should be tried. If neither work then the erroneous resource should notify its own content owner to manually contact the referring resource. That way you’ll get notified when you are throwing 404s, even if the other guy isn’t fixing it. That isn’t to say you have to be notified for every 404 ever thrown. If a resource is only accessed once every few months it may not be economic to fix. There are a lot of options here about what we can do. I don’t expect the web will ever be perfect. It’s unreasonable to expect every web site owner to fix all human error that has ever occurred on their site, but if they are getting twenty people a minute going to a 404 or even a 301 someone should be doing something about it.
All this assumes, of course, that errors are being used correctly. If someone is using a 301 to refer visitors to their homepage, then implementing this system could break all the links on your site. It is also feasible that someone might try to use it to stop you from deep linking. Deep linking is creating a link directly to a resource rather than to the homepage of a site. Some content owners have objected to, and even sued over, deep linking to their content - they argue it unfairly damages their revenue stream.
In summary, right now we don’t use half the power of our error pages because we only tell the visitors about the error, and not the people who sent them there in the first place. We have the technology to automatically update our content with fixes and optimizations, and notify content owners when an automatic fix isn’t possible. The only thing we need to do is go out there and do it!





Add New Comment
Thanks. Your comment is awaiting approval by a moderator.
Do you already have an account? Log in and claim this comment.
Add New Comment