Lessons in Web Spidering: Spidering 1011: How to fetch an entire Web-site with wget

"GNU Wget is a free utility for non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies."

Basic wget usage is to fetch a single URL:

wget http://example.org/

But, we want to spider an entire site recursively! The manual is most useful for anybody who is competent:

man wget

You will want to use the --recursive and --level options.

Some HTTP daemons block strange user agents. You can masquerade as an ordinary browser with --user-agent.

Some Web scripts block requests that do not have a referrer (you must click a link to the URL and not access it directly). You can pretend that you were referred from a page with --referrer.

Other useful options for recursive spidering:

--accept/--reject: Specify comma-separated lists of file name suffixes or patterns to accept or reject.
--domains/--exclude-domains: Set domains to be followed.
--follow-tags/--ignore-tags: Wget has an internal table of HTML tag / attribute pairs that it considers when looking for linked documents during a recursive retrieval. If a user wants only a subset of those tags to be considered, however, he or she should be specify such tags in a comma-separated list with this option.
--span-hosts: Enable spanning across hosts when doing recursive retrieving.
--no-parent: Do not ever ascend to the parent directory when retrieving recursively.

Off course, there are many more options that will be useful such as --include-directories any many more!

Lynx (a text-based Web browser) is also as useful tool.

Lessons in Web Spidering

Wednesday, 24 June 2009

Spidering 1011: How to fetch an entire Web-site with wget

No comments:

Post a Comment

Blog Archive

Search This Blog

Other Resources