I needed to download entire web page to my local computer recently. I had several requirements:
- "look and feel" of webpage must stay exactly the same,
- all internal and external links must stay valid,
- all javascripts must work.
I found out this task even more simple than I expected. I decided to use GNU wget. This command line tool is included in most Linux distributions. Windows version is also availible. Basic usage is quite straightforward:
wget https://example.com/
This will download the file index.html to the current folder. To achieve result I described at the beggining we have to do some more magic:
$ wget \
--recursive \
--no-clobber \
--page-requisites \
--html-extension \
--convert-links \
--restrict-file-names=windows \
--domains example.com \
--no-parent \
https://www.example.com/folder1/folder/
This will do exactly what we wanted. Let's see what each of this switches does:
- --recursive - recurively download all files that are linked from main file,
- --no-clobber - do not overwrite files that already exist locally (useful when previous run failed for any reason),
- --page-requisites - download all page elements (JS, CSS, ..),
- --html-extension - add .html extension to files (if not already there),
- --convert-links - fix links in html files to work offline,
- --restrict-file-names=windows - rename files to work also in Windows,
- --domains example.com - limit downloads to listed domains (links that point to other domains will not be followed),
- --no-parent - do not download files form folders below given root folder (folder1/folder/ in our example; files from /folder1 are not going to be transferred).