I needed to download entire web page to my local computer recently. I had several requirements:
- "look and feel" of webpage must stay exactly the same,
- all internal and external links must stay valid,
- all javascripts must work.
I found out this task even more simple than I expected. I decided to use GNU wget. This command line tool is included in most Linux distributions. Windows version is also availible. Basic usage is quite straightforward:
wget https://example.com/
This will download the file index.html to the current folder. To achieve result I described at the beggining we have to do some more magic:
$ wget \ --recursive \ --no-clobber \ --page-requisites \ --html-extension \ --convert-links \ --restrict-file-names=windows \ --domains example.com \ --no-parent \ https://www.example.com/folder1/folder/
This will do exactly what we wanted. Let's see what each of this switches does:
- --recursive - recurively download all files that are linked from main file,
- --no-clobber - do not overwrite files that already exist locally (useful when previous run failed for any reason),
- --page-requisites - download all page elements (JS, CSS, ..),
- --html-extension - add .html extension to files (if not already there),
- --convert-links - fix links in html files to work offline,
- --restrict-file-names=windows - rename files to work also in Windows,
- --domains example.com - limit downloads to listed domains (links that point to other domains will not be followed),
- --no-parent - do not download files form folders below given root folder (folder1/folder/ in our example; files from /folder1 are not going to be transferred).