Tuesday, August 26, 2008

recursive dowload with wget

I often need to download documentation for offline use. Online docs typically come as a finely branched tree of separate .html files; sometimes a one-page or PDF version is also available, but those aren't as convenient to use.

I'd gotten sick of using "File/Save As..." from my browser again and again, but Firefox shortcuts like DownThemAll flattened the directory structure, broke the internal links, and didn't crawl them to get sub-pages.

The answer, of course, was in my /usr/bin all along: wget.
catherine@Elli:~/docs$ wget --recursive --convert-links \
> --page-requisites --no-parent \
> http://www.postgresql.org/docs/8.3/interactive/index.html

or, shortcut form,

catherine@Elli:~/docs$ wget -rkp -np \
> http://www.postgresql.org/docs/8.3/interactive/index.html

pulls down the everything beneath the index page, and it turns out perfectly, with a single command!

--convert-links adjusts all the internal hyperlinks so they'll work perfectly in your downloaded version.

Without --no-parent, wget would find the link to "Home" and download everything under that.

--page-requisites makes sure you get the images, stylesheets, etc. for your pages to work properly.

Now bookmark the downloaded top page in firefox, give the bookmark a keyword, and voila! Now just typing your keyword into the Firefox URL bar gets you to your perfect offline mirror.

3 comments:

mailto:stax77@t-online.de said...

Thanks for your quick help.
(Sometime you simply need the syntax.)
I used gnu Wget on windows 2k8.

Best regards from Germany

Anonymous said...

Thanks for you quick help.
Sometimes you simply need a syntax how-to.

I used gnu Wget on Windows 2k8

Best Regards from Germany

Thorsten

mailto:stax77(ät)t-online.de

Unknown said...

But what about external links? i.e. if i wget wikipedia, would this download every single website, that is referenced on the wikipedia-pages?

This could end in downloading the whole internet :-)