Lessons in Web Spidering

Saturday, 27 June 2009

(Off Topic) My Shiny Tiny WRAP Firewall Running Voyage Linux

So, I purchased a WRAP (Wireless Router Application Platform) from Yawarra, that can be used as firewalls or wireless routers. The advantage with Yawarra is that they give you a nice chassis to work with.

Other bits I needed:

A 1GB CompactFlash card for the Operating System.
An all-in-one card reader/writer.
A null modem serial cable, for connecting to the WRAP.
A USB to RS232 Converter, because most desktop motherboards do not have serial ports these days.

Next step, installing Voyage Linux, which is based on Debian. Very easy to follow the README.

Victory! Below is a screenshot using CuteCom to connect to the router via a serial cable. Notice that I am using /dev/ttyUSB0 because I am using a USB to RS232 Converter.

A picture of the WRAP:

Planned usage:

Install ~~Asterisk~~ rather OpenSIPS and use it as a VoIP router.

Thursday, 25 June 2009

How to compile ELinks from source on Debian or Ubuntu

ELinks is a text-based Web browser, that is rather handy.

First install GnuTLS development files to allow SSL support:

aptitude install libgnutls-dev

Then, compile and install ELinks:

wget http://elinks.or.cz/download/elinks-0.12pre4.tar.bz2
tar -xjvf elinks-0.12pre4.tar.bz2
cd elinks-0.12pre4
./configure --with-gnutls
make
make install

Wednesday, 24 June 2009

Spidering 104: How to crawl a Web-site into a MySQL database (coming soon)

You were introduced to wget in lesson 1011. Now we are going to put the fetched documents into a database with meta information such as the URL, retrieval date, outgoing links, etc. This will provide a simple point of integration into other parts of your application suite and an indexed table to lookup your data. Indexed data is stored in Random Access Memory for quick retrieval.

Coming soon...

Spidering 103: How to analyse HTTP traffic

Analysing HTTP traffic is useful for discovering the personality of Web-sites.

You can analyse HTTP traffic with the Firebug extension for Mozilla Firefox. If you need to pretend to be using Internet Explorer for some reason, you can use a User Agent Switcher.

Wireshark and tcpdump are also useful, but may be annoying when trying to analyse HTTPS and do not provide integration with your Web browser.

(Off Topic) How to measure the performance of your application development team

If you ever end up working for a company that implements Key Performance Indicators, the below objectives, actions and KPIs (measures) may be useful for defining your role as an Analyst Programmer:

Objective 1: Design Simple Solutions That Meet Requirements

Actions:

Keep the design simple
Use known design patterns where appropriate
Split design into comprehensive components
Focus on deliverables during design phase
Maintain a high degree of discussion between developers
Maintain communication with business owners
Maintain communication with domain experts
Seek feedback from business owners early
Use mock-ups where appropriate to convey concepts clearly

Measures:

Perceived visibility to design process
Discuss effectiveness of design in post-release meeting

Objective 2: Efficiently Implement Maintainable & Reliable Solutions

Actions:

Write unit tests
Comment code (where appropriate; do not add useless comments)
Document APIs
Think carefully when naming classes, methods, properties etc
Use the Issue Tracker
Split work into smaller tasks
Plan all work (do not skip the design phase)
Conduct peer review
Communicate with other developers
Ensure testing processes are followed
Do not over-engineer implementations (KISS: Keep It Simple, Stupid)
Produce efficient applications (optimisation, high performance, scalability, lower hardware costs)

Measures:

Assess code readability and documentation
Unit test coverage reports
Feedback from peer review
Assess difficulty in refactoring
Assess perceived regression (in relation to scope) that is the result of not following process
Assess actual regression (in relation to scope) that is the result of not following process
Assess responsiveness and scalability of applications
Assess Issue Tracker usage and organisation skills

Objective 3: Effectively Deliver & Coordinate Projects According to Schedule

Actions:

Produce reliable time-estimates
Maintain high-visibility with a Gantt chart
Split work into comprehensive tasks
Be organised
Hold meetings
Communicate with business owners
Do not accept changes without a formal change request
Any changes to the design must be reflected in the schedule
Plan releases and patches

Measures:

Actual adherence to schedule
Perceived adherence to schedule (visibility)
Assess scope creep

Objective 4: Maintain Reliability of Deployed Systems

Actions:

Apply risk management discipline
Opt for low-risk, pragmatic solutions
Investigate alternative (“proper”) solutions regularly
Triage new bugs
Reduce key-person dependencies by sharing knowledge and using documentation
Develop maintainable solutions
Develop reliable solutions

Measures:

Assess actual regression from patches
Assess perceived regression from patches
Assess difficulty in understanding code
Assess difficulty in refactoring and making changes to code
Assess the Bus factor.
Assess issue resolution handling

Objective 5: Foster Positive Customer Relationships

Actions:

Manage meetings effectively
Communicate with customers via a Technical Portal
Triage issues and maintain communication

Measures:

Customer feedback

Objective 6: Self Development

Actions:

Seek feedback from other developers
Approach new roles within the team
Further your own understanding
Raise questions and/or suggestions at weekly meetings
Demonstrate initiative (improve process, etc)

Measures:

Peer review
Team meetings
Assess initiatives undertaken
Assess courses undertaken that were relevant to the business

Objective 7: OH&S

Actions:

Use ergonomic human interface devices to avoid RSI
Maintain a healthy posture
Take frequent breaks away from the computer (split up your work day)
Avoid exposure to noisy hardware
Avoid eye strain

Spidering 1012: How to transform malformed HTML into easy to use XML (XHTML)

Step 1, Use HTML Tidy to transform the HTML document into XHTML:

tidy -asxhtml < bad.html > good.html

Now, Tidy sometimes fails on bad data (such as binary code)! No worries: this is where you manually write a script that removes any bad data that HTML Tidy chokes on. You will need to do string replacement of the bad data with a blank string. You may find regular expressions useful where the string to replace varies. Then, pipe it into Tidy as usual!

Now we have nice clean XHTML that we can parse with an XML parser an manipulate very easily with DOM and XPath! Enjoy!

Spidering 1011: How to fetch an entire Web-site with wget

"GNU Wget is a free utility for non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies."

Basic wget usage is to fetch a single URL:

wget http://example.org/

But, we want to spider an entire site recursively! The manual is most useful for anybody who is competent:

man wget

You will want to use the --recursive and --level options.

Some HTTP daemons block strange user agents. You can masquerade as an ordinary browser with --user-agent.

Some Web scripts block requests that do not have a referrer (you must click a link to the URL and not access it directly). You can pretend that you were referred from a page with --referrer.

Other useful options for recursive spidering:

--accept/--reject: Specify comma-separated lists of file name suffixes or patterns to accept or reject.
--domains/--exclude-domains: Set domains to be followed.
--follow-tags/--ignore-tags: Wget has an internal table of HTML tag / attribute pairs that it considers when looking for linked documents during a recursive retrieval. If a user wants only a subset of those tags to be considered, however, he or she should be specify such tags in a comma-separated list with this option.
--span-hosts: Enable spanning across hosts when doing recursive retrieving.
--no-parent: Do not ever ascend to the parent directory when retrieving recursively.

Off course, there are many more options that will be useful such as --include-directories any many more!

Lynx (a text-based Web browser) is also as useful tool.

Lessons in Web Spidering

Saturday, 27 June 2009

(Off Topic) My Shiny Tiny WRAP Firewall Running Voyage Linux

Thursday, 25 June 2009

How to compile ELinks from source on Debian or Ubuntu

Wednesday, 24 June 2009

Spidering 104: How to crawl a Web-site into a MySQL database (coming soon)

Spidering 103: How to analyse HTTP traffic

(Off Topic) How to measure the performance of your application development team

Objective 1: Design Simple Solutions That Meet Requirements

Actions:

Measures:

Objective 2: Efficiently Implement Maintainable & Reliable Solutions

Actions:

Measures:

Objective 3: Effectively Deliver & Coordinate Projects According to Schedule

Actions:

Measures:

Objective 4: Maintain Reliability of Deployed Systems

Actions:

Measures:

Objective 5: Foster Positive Customer Relationships

Actions:

Measures:

Objective 6: Self Development

Actions:

Measures:

Objective 7: OH&S

Actions:

Spidering 1012: How to transform malformed HTML into easy to use XML (XHTML)

Spidering 1011: How to fetch an entire Web-site with wget

Blog Archive

Search This Blog

Other Resources

Lessons in Web Spidering

Saturday, 27 June 2009

(Off Topic) My Shiny Tiny WRAP Firewall Running Voyage Linux

Thursday, 25 June 2009

How to compile ELinks from source on Debian or Ubuntu

Wednesday, 24 June 2009

Spidering 104: How to crawl a Web-site into a MySQL database (coming soon)

Spidering 103: How to analyse HTTP traffic

(Off Topic) How to measure the performance of your application development team

Objective 1: Design Simple Solutions That Meet Requirements

Actions:

Measures:

Objective 2: Efficiently Implement Maintainable & Reliable Solutions

Actions:

Measures:

Objective 3: Effectively Deliver & Coordinate Projects According to Schedule

Actions:

Measures:

Objective 4: Maintain Reliability of Deployed Systems

Actions:

Measures:

Objective 5: Foster Positive Customer Relationships

Actions:

Measures:

Objective 6: Self Development

Actions:

Measures:

Objective 7: OH&S

Actions:

Spidering 1012: How to transform malformed HTML into easy to use XML (XHTML)

Spidering 1011: How to fetch an entire Web-site with wget

Blog Archive

Search This Blog

Other Resources

Subscribe To