Innnards: XENU linkchecker software: why does it hate Wikipedia?

As the author of the WWW FAQ, I regularly answer questions about the workings of the Web. If a question is frequently asked, I simply add an article to the FAQ. But sometimes a question is more detailed, more in-depth— not really a FAQ, but still of interest to others. You'll find those questions, with my answers, here in Innards along with commentary on other web-technology-related topics.

2007-10-26

Q. I use a program called XENU to check my web site for broken links. All links to Wikipedia are reported as 403 errors. Why?

A. XENU is a nice piece of freeware that "crawls" from link to link on your site looking for links that don't work anymore.

You gave me the following example of a perfectly good link that XENU reports as a 403 "Forbidden" error:

http://en.wikipedia.org/wiki/Antioxidant/

There's nothing wrong with the link. So why doesn't it work for XENU?

The problem is that Wikipedia is specifically refusing to talk to XENU. Wikipedia looks at the browser-identifying "User-Agent:" that XENU sends when it connects to the Wikipedia server, says "aha, you are not a real person" and refuses to talk to XENU.

So, unfortunately, you can't use XENU to check Wikipedia links— but this is Wikipedia's policy, not XENU's, and there's nothing the XENU author can really do about it (short of pretending to be another browser in order to trick Wikipedia, which I do not recommend).

The Wikipedia folks probably wish to limit unnecessary, non-human traffic as much as possible because they are providing an extremely popular service for free and need every drop of bandwidth they can afford.

How Did You Figure This Out?

Simple: I pretended to be a web browser.

To do its magic, XENU must act as a simple web browser, asking the web server for pages in exactly the same way that a "real" browser like Firefox or Internet Explorer would.

To find out what was going wrong, I connected directly to the Wikipedia web server with a telnet program and talked to it myself, carrying out my very own HTTP session much as a web browser might. Here's a transcript:

telnet en.wikipedia.org 80
Trying 66.230.200.100...
Connected to en.wikipedia.org.
Escape character is '^]'.
GET /wiki/Antioxidant/ HTTP/1.0
Host: en.wikipedia.org

HTTP/1.0 403 Forbidden
Date: Fri, 26 Oct 2007 13:08:52 GMT
Server: Apache
X-Powered-By: PHP/5.2.1
Vary: Accept-Encoding
Content-Length: 35
Content-Type: text/html
X-Cache: MISS from sq28.wikimedia.org
X-Cache-Lookup: MISS from sq28.wikimedia.org:3128
X-Cache: MISS from sq22.wikimedia.org
X-Cache-Lookup: MISS from sq22.wikimedia.org:80
Via: 1.0 sq28.wikimedia.org:3128 (squid/2.6.STABLE13), 1.0
sq22.wikimedia.org:80 (squid/2.6.STABLE13)
Connection: close

Please provide a User-Agent header
Connection closed by foreign host.

Notice that I received a "403 Forbidden" response, much as XENU does.

Why? Because my request was very minimalist— I didn't identify myself with a User-Agent: line. This is the line on which the web browser announces itself to the web server: "hi! I'm Mozilla Firefox for Windows! What's your name?"

Most web sites aren't picky about this— they don't absolutely require a User-Agent: line. But Wikipedia is a pickypedia and doesn't care one bit for rude people who don't introduce themselves. That's why I got this response:

Please provide a User-Agent header

... And a 403 error.

If I try the same "telnet test" again and make sure to include a User-Agent: line, like this:

GET /wiki/Antioxidant/ HTTP/1.0
Host: en.wikipedia.org
User-Agent: TomIsAHumanBeing1.0

Then my request is successful, and I receive the proper page back from the Wikipedia server. A little politeness goes a long way.

But Is XENU Identifying Itself?

Yes, actually. I accessed a test page on my own web site, then looked at my web server's access log file. I found this line (broken into three lines just for readability's sake):

68.32.52.77 - - [26/Oct/2007:08:53:24 -0500]
"GET /boutell/tmp/xenu/ HTTP/1.1" 200 74 www.boutell.com
"-" "Xenu Link Sleuth 1.2j"

XENU is identifying itself as Xenu Link Sleuth 1.2j. That's good behavior— every program that talks to web servers (that is, every web browser) should identify itself.

So, why is XENU getting no love from Wikipedia? Because Wikipedia is specifically recognizing XENU in the request and refusing to play along. XENU gets a 403 error with a different body than I did in my first test, but it gets a 403 error just the same.

I tested this by making my request again with XENU's user agent name:

GET /wiki/Antioxidant/ HTTP/1.0
Host: en.wikipedia.org
User-Agent: Xenu Link Sleuth 1.2j

When I did this, I received a 403 error (with a different, longer body than that shown above, but a 403 error all the same).

A simple change to the User-Agent (changing "Xenu" to "Bob") resulted in a successful retrieval of the page.

Since this shows that Wikipedia is deliberately blocking XENU, there is nothing that XENU could or should do differently. Folks will just have to accept that Wikipedia forbids the use of XENU to check incoming links to their site— or ask the Wikipedia foundation to change the policy. Since this involves more traffic for Wikipedia to handle, attaching a nice donation check to that request couldn't hurt.


Follow us on Twitter | Contact Us

Copyright 1994-2014 Boutell.Com, Inc. All Rights Reserved.