- A decent operating system.
- A database for storing your data and organising it into processing queues.
- HTML Tidy for parsing malformed HTML into well-formed XHTML.
- A scripting language that supports regular expressions for doing dirty work.
- An XML parser that supports DOM and XPath.
- A decent WWW automation client for fetching pages and following links.
- A brain.
- A purpose (of good intent).
Wednesday, 24 June 2009
Spidering 101: Meet your toolbox
Useful tools for spidering:
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment