Download entire web page using wget

I needed to download entire web page to my local computer recently. I had several requirements:

  • "look and feel" of webpage must stay exactly the same,
  • all internal and external links must stay valid,
  • all javascripts must work.

I found out this task even more simple than I expected. I decided to use GNU wget. This command line tool is included in most Linux distributions. Windows version is also availible. Basic usage is quite straightforward:

wget https://example.com/

This will download the file index.html to the current folder. To achieve result I described at the beggining we have to do some more magic:

$ wget \
     --recursive \
     --no-clobber \
     --page-requisites \
     --html-extension \
     --convert-links \
     --restrict-file-names=windows \
     --domains example.com \
     --no-parent \
         https://www.example.com/folder1/folder/

This will do exactly what we wanted. Let's see what each of this switches does:

  • --recursive - recurively download all files that are linked from main file,
  • --no-clobber - do not overwrite files that already exist locally (useful when previous run failed for any reason),
  • --page-requisites - download all page elements (JS, CSS, ..),
  • --html-extension - add .html extension to files (if not already there),
  • --convert-links - fix links in html files to work offline,
  • --restrict-file-names=windows - rename files to work also in Windows,
  • --domains example.com - limit downloads to listed domains (links that point to other domains will not be followed),
  • --no-parent - do not download files form folders below given root folder (folder1/folder/ in our example; files from /folder1 are not going to be transferred).