"
GNU Wget is a free utility for non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies."
Basic wget usage is to fetch a single URL:
wget http://example.org/
But, we want to spider an entire site
recursively! The manual is most useful for anybody who is competent:
man wget
You will want to use the
--recursive and
--level options.
Some HTTP daemons block strange user agents. You can masquerade as an ordinary browser with
--user-agent.
Some Web scripts block requests that do not have a referrer (you must click a link to the URL and not access it directly). You can pretend that you were referred from a page with
--referrer.
Other useful options for recursive spidering:
- --accept/--reject: Specify comma-separated lists of file name suffixes or patterns to accept or reject.
- --domains/--exclude-domains: Set domains to be followed.
- --follow-tags/--ignore-tags: Wget has an internal table of HTML tag / attribute pairs that it considers when looking for linked documents during a recursive retrieval. If a user wants only a subset of those tags to be considered, however, he or she should be specify such tags in a comma-separated list with this option.
- --span-hosts: Enable spanning across hosts when doing recursive retrieving.
- --no-parent: Do not ever ascend to the parent directory when retrieving recursively.
Off course, there are many more options that will be useful such as
--include-directories any many more!
Lynx (a text-based Web browser) is also as useful tool.