WWW FAQs: How do I mirror another website?


2006-06-28: Downloading a copy of an entire website ("mirroring") is a tricky business, because modern websites are so often built with PHP, ASP, CGI and other dynamic technologies that constantly update every page and often produce URLs that are only used once. And that can make the site appear infinitely large to website mirroring software.

Also, mirroring the web pages your browser sees on a site doesn't mean you'll get the "dynamic" behavior. if a website displays the current weather in Cleveland, mirroring the current pages will only get you a frozen "snapshot" of the weather on that particular day.

That said, though, there are tools available that will help you mirror basic websites that don't have these problems. The best-known of these is GNU wget, a free, open-source tool that can easily fetch an entire website with a single command. wget is not the friendliest tool in the world, but boy does it work!

Mirroring a website on Windows

If you are running Windows, I recommend Tech Knight's wget for Windows site. Tech Knight offers step-by-step instructions to download and use the wget software on Windows.

Mirroring a website on Linux

You almost certainly have wget already. Try wget --help at the command line. If you get an error message, install wget with your Linux distribution's package manager. Or fetch it from the official wget page and compile your own copy from source.

Once you have wget installed correctly, the command line to mirror a website is:

wget -m -k -K -E http://url/of/web/site

See man wget or wget --help | more for a detailed explanation of each option.

If this command seems to run forever, there may be parts of the site that generate an infinite series of different URLs. You can combat this in many ways, the simplest being to use the -l option to specify how many links "away" from the home page wget should travel. For instance, -l 3 will refuse to download pages more than three clicks away from the home page. You'll have to experiment with different values for -l. Consult man wget for additional workarounds.

Note: some web servers may be set up to "punish" users who download too much, too fast. If you're not careful, using tools like wget could get your IP address banned from the site. You can avoid this problem by using the -w option to specify a delay, in seconds, between page downloads. Usually, this will prevent the web server from viewing your behavior as unacceptable. But your mileage may vary!

Mirroring a website on MacOS X

Like Linux, MacOS X is a version of Unix. However, wget isn't standard equipment in all versions of MacOS X. If you receive an error message when you try the wget --help command at the MacOS X "Terminal" prompt, you can fetch wget from the TPJ site, which also offers "Simple wget," a user-friendly front end to wget. Most of the site is in Japanese, so some patience is necessary in picking your way through!

Of course, you can also install the developer tools from your MacOS X system CD (if you have not already done so) and then visit the official wget page to build and install wget from source code.

Once you have the command line version of wget for MacOS X installed, just follow my Linux instructions at the MacOS X Terminal prompt.

Offering Your Mirror To The World

Publicly mirroring someone else's website without their permission is a violation of copyright law. Don't do that.

If you have received their permission, it's easy to offer your mirror to the world. Just use the wget command to download it to a directory inside your own website's space. This is much easier if you have command line access to your own web server so that you can run wget there directly. But you can also upload the mirrored site to your server by dragging and dropping it to your usual file transfer program after wget is finished.

If you do offer a mirror of another site, make sure you link to the original and explain to users that this is a mirror and not the original. Also be sure to keep your mirror up to date. And once again, get the original site's permission first!

Legal Note: yes, you may use sample HTML, Javascript, PHP and other code presented above in your own projects. You may not reproduce large portions of the text of the article without our express permission.

Got a LiveJournal account? Keep up with the latest articles in this FAQ by adding our syndicated feed to your friends list!