Wednesday, 24 June 2009

Spidering 1011: How to fetch an entire Web-site with wget

"GNU Wget is a free utility for non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies."

Basic wget usage is to fetch a single URL:
wget http://example.org/
But, we want to spider an entire site recursively! The manual is most useful for anybody who is competent:
man wget
You will want to use the --recursive and --level options.

Some HTTP daemons block strange user agents. You can masquerade as an ordinary browser with --user-agent.

Some Web scripts block requests that do not have a referrer (you must click a link to the URL and not access it directly). You can pretend that you were referred from a page with --referrer.

Other useful options for recursive spidering:
  • --accept/--reject: Specify comma-separated lists of file name suffixes or patterns to accept or reject.
  • --domains/--exclude-domains: Set domains to be followed.
  • --follow-tags/--ignore-tags: Wget has an internal table of HTML tag / attribute pairs that it considers when looking for linked documents during a recursive retrieval. If a user wants only a subset of those tags to be considered, however, he or she should be specify such tags in a comma-separated list with this option.
  • --span-hosts: Enable spanning across hosts when doing recursive retrieving.
  • --no-parent: Do not ever ascend to the parent directory when retrieving recursively.
Off course, there are many more options that will be useful such as --include-directories any many more!

Lynx (a text-based Web browser) is also as useful tool.

No comments:

Post a Comment