The “Wayback Machine” is one of the more important services in the history of the Internet (and happens to be named after a great gag on the old Rocky & Bullwinkle show). From about 2001 through 2005 it could be counted on it to give you a reasonable snapshot of many of the home pages that existed on the Web as were archived by the old Alexa webcrawler going back to as early as 1996. Likely due to legal complaints or expensive maintenance costs the regular snapshots petered out around 2005.
I do enjoy going back and revisiting projects I used to work on (and goofy hacks!) like:
Because I wasn’t able to find another service to replicate the functionality of the Wayback Machine I decided to write my own routines to create my own daily snapshots. Here’s a simple shell script that I run nightly by placing it in the /etc/cron.daily directory on my Debian-based Linux distro:
Here’s a quick rundown of what’s happening:
| Flag | Alias | Description |
| -E | --adjust-extension | if the requested file appears to be an HTML document but does not end with an HTML extension (for example, “.asp”) then rename the file with using the extension “.html” |
| -H | --span-hosts | allow downloads from other domains if necessary |
| -k | --convert-links | convert links to enable local (offline) viewing |
| -K | --backup-converted | save original copies of any edited files with using a .orig extension |
| -p | --page-requisites | download any supplemental files needed to render the document |
| -nd | --no-directories | store all the files in a single directory instead of creating a new directory for each unique hostname |
It’s that easy. Be careful not to over-ping the servers you will be archiving or fill up your hard drive with poorly designed usage of wget’s recursion flags.
This entry was posted by on November 04, 2010 in Linux, Sys Admin and Tutorial.
Comments