Build your own "Mini Wayback Machine"

The “Wayback Machine” is one of the more important services in the history of the Internet (and happens to be named after a great gag on the old Rocky & Bullwinkle show). From about 2001 through 2005 it could be counted on it to give you a reasonable snapshot of many of the home pages that existed on the Web as were archived by the old Alexa webcrawler going back to as early as 1996. Likely due to legal complaints or expensive maintenance costs the regular snapshots petered out around 2005.

I do enjoy going back and revisiting projects I used to work on (and goofy hacks!) like:

Because I wasn’t able to find another service to replicate the functionality of the Wayback Machine I decided to write my own routines to create my own daily snapshots. Here’s a simple shell script that I run nightly by placing it in the /etc/cron.daily directory on my Debian-based Linux distro:

Here’s a quick rundown of what’s happening:

Flag Alias Description
-E --adjust-extension if the requested file appears to be an HTML document but does not end with an HTML extension (for example, “.asp”) then rename the file with using the extension “.html”
-H --span-hosts allow downloads from other domains if necessary
-k --convert-links convert links to enable local (offline) viewing
-K --backup-converted save original copies of any edited files with using a .orig extension
-p --page-requisites download any supplemental files needed to render the document
-nd --no-directories store all the files in a single directory instead of creating a new directory for each unique hostname

It’s that easy. Be careful not to over-ping the servers you will be archiving or fill up your hard drive with poorly designed usage of wget’s recursion flags.

This entry was posted by Scott Fitchet on November 04, 2010 in Linux, Sys Admin and Tutorial.

Comments

Author

This entry was posted by Scott Fitchet on November 04, 2010 in Linux, Sys Admin and Tutorial.

Recent posts from this author

Related on the Bocoup Blog

Advertisement

Twitter

Google+