Download entire web page using wget

Thu, 12.05.2011 - 23:39

I needed to download entire web page to my local computer recently. I had several requirements:

"look and feel" of webpage must stay exactly the same,
all internal and external links must stay valid,
all javascripts must work.

I found out this task even more simple than I expected. I decided to use GNU wget. This command line tool is included in most Linux distributions. Windows version is also availible. Basic usage is quite straightforward:

wget https://example.com/

This will download the file index.html to the current folder. To achieve result I described at the beggining we have to do some more magic:

$ wget \
     --recursive \
     --no-clobber \
     --page-requisites \
     --html-extension \
     --convert-links \
     --restrict-file-names=windows \
     --domains example.com \
     --no-parent \
         https://www.example.com/folder1/folder/

This will do exactly what we wanted. Let's see what each of this switches does:

--recursive - recurively download all files that are linked from main file,
--no-clobber - do not overwrite files that already exist locally (useful when previous run failed for any reason),
--page-requisites - download all page elements (JS, CSS, ..),
--html-extension - add .html extension to files (if not already there),
--convert-links - fix links in html files to work offline,
--restrict-file-names=windows - rename files to work also in Windows,
--domains example.com - limit downloads to listed domains (links that point to other domains will not be followed),
--no-parent - do not download files form folders below given root folder (folder1/folder/ in our example; files from /folder1 are not going to be transferred).