<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-6862015143155093240</id><updated>2012-02-16T20:51:17.937+10:30</updated><category term='voyage linux'/><category term='xml'/><category term='router'/><category term='business'/><category term='tcpdump'/><category term='dom'/><category term='javascript'/><category term='html tidy'/><category term='perl'/><category term='tutorial'/><category term='firebug'/><category term='voip router'/><category term='cutecom'/><category term='crawling'/><category term='wireshark'/><category term='http'/><category term='regex'/><category term='user-agent'/><category term='asterisk'/><category term='opensips'/><category term='elinks'/><category term='recaptcha'/><category term='html'/><category term='xpath'/><category term='wrap'/><category term='debian'/><category term='yawarra'/><category term='xhtml'/><category term='slashdot'/><category term='parser'/><category term='embedded devices'/><category term='spidering'/><category term='wget'/><category term='database'/><category term='google'/><title type='text'>Lessons in Web Spidering</title><subtitle type='html'>Tutorials about Data Mining techniques for hackers.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://spidering-lessons.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6862015143155093240/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://spidering-lessons.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Damien Bezborodov</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://2.bp.blogspot.com/_470YrbjRO00/SkIcrDkRZzI/AAAAAAAAAAo/faARWQek850/S220/p6240078.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>9</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-6862015143155093240.post-2771682439498837235</id><published>2009-06-27T15:22:00.009+09:30</published><updated>2009-07-02T23:32:24.288+09:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='voip router'/><category scheme='http://www.blogger.com/atom/ns#' term='yawarra'/><category scheme='http://www.blogger.com/atom/ns#' term='cutecom'/><category scheme='http://www.blogger.com/atom/ns#' term='opensips'/><category scheme='http://www.blogger.com/atom/ns#' term='router'/><category scheme='http://www.blogger.com/atom/ns#' term='wrap'/><category scheme='http://www.blogger.com/atom/ns#' term='voyage linux'/><category scheme='http://www.blogger.com/atom/ns#' term='asterisk'/><category scheme='http://www.blogger.com/atom/ns#' term='embedded devices'/><title type='text'>(Off Topic) My Shiny Tiny WRAP Firewall Running Voyage Linux</title><content type='html'>So, I purchased a &lt;a href="http://www.pcengines.ch/wrap.htm"&gt;WRAP (Wireless Router Application Platform)&lt;/a&gt; from &lt;a href="http://www.yawarra.com.au/"&gt;Yawarra&lt;/a&gt;, that can be used as firewalls or wireless routers. The advantage with Yawarra is that they give you a nice chassis to work with.&lt;br /&gt;&lt;br /&gt;Other bits I needed:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A 1GB CompactFlash card for the Operating System.&lt;/li&gt;&lt;li&gt;An all-in-one card reader/writer.&lt;/li&gt;&lt;li&gt;A null modem serial cable, for connecting to the WRAP.&lt;/li&gt;&lt;li&gt;A USB to RS232 Converter, because most desktop motherboards do not have serial ports these days.&lt;/li&gt;&lt;/ul&gt;Next step, installing &lt;a href="http://linux.voyage.hk/"&gt;Voyage Linux&lt;/a&gt;, which is based on Debian. Very easy to follow the README.&lt;br /&gt;&lt;br /&gt;Victory! Below is a screenshot using &lt;a href="http://cutecom.sourceforge.net/"&gt;CuteCom&lt;/a&gt; to connect to the router via a serial cable. Notice that I am using &lt;code&gt;/dev/ttyUSB0&lt;/code&gt; because I am using a USB to RS232 Converter.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_470YrbjRO00/SkW2mMtGGJI/AAAAAAAAABY/sd8eGEzr5zw/s1600-h/victory.png"&gt;&lt;img style="cursor: pointer; width: 320px; height: 306px;" src="http://2.bp.blogspot.com/_470YrbjRO00/SkW2mMtGGJI/AAAAAAAAABY/sd8eGEzr5zw/s320/victory.png" alt="" id="BLOGGER_PHOTO_ID_5351884499561355410" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;A picture of the WRAP:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_470YrbjRO00/SkW3JcOmWRI/AAAAAAAAABg/NqBFi8qoy-I/s1600-h/p6270081.jpg"&gt;&lt;img style="cursor: pointer; width: 320px; height: 240px;" src="http://3.bp.blogspot.com/_470YrbjRO00/SkW3JcOmWRI/AAAAAAAAABg/NqBFi8qoy-I/s320/p6270081.jpg" alt="" id="BLOGGER_PHOTO_ID_5351885105023834386" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Planned usage:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Install &lt;strike&gt;&lt;a href="http://www.asterisk.org/"&gt;Asterisk&lt;/a&gt;&lt;/strike&gt; rather &lt;a href="http://www.opensips.org/"&gt;OpenSIPS&lt;/a&gt; and use it as a VoIP router.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6862015143155093240-2771682439498837235?l=spidering-lessons.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://spidering-lessons.blogspot.com/feeds/2771682439498837235/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://spidering-lessons.blogspot.com/2009/06/off-topic-my-shiny-tiny-wrap-firewall.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6862015143155093240/posts/default/2771682439498837235'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6862015143155093240/posts/default/2771682439498837235'/><link rel='alternate' type='text/html' href='http://spidering-lessons.blogspot.com/2009/06/off-topic-my-shiny-tiny-wrap-firewall.html' title='(Off Topic) My Shiny Tiny WRAP Firewall Running Voyage Linux'/><author><name>Damien Bezborodov</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://2.bp.blogspot.com/_470YrbjRO00/SkIcrDkRZzI/AAAAAAAAAAo/faARWQek850/S220/p6240078.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_470YrbjRO00/SkW2mMtGGJI/AAAAAAAAABY/sd8eGEzr5zw/s72-c/victory.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6862015143155093240.post-8276223240508955900</id><published>2009-06-25T12:24:00.010+09:30</published><updated>2009-06-25T12:58:04.125+09:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='elinks'/><title type='text'>How to compile ELinks from source on Debian or Ubuntu</title><content type='html'>&lt;a href="http://elinks.cz/"&gt;ELinks&lt;/a&gt; is a text-based Web browser, that is rather handy.&lt;br /&gt;&lt;br /&gt;First install &lt;a href="http://www.gnu.org/software/gnutls/"&gt;GnuTLS&lt;/a&gt; development files to allow SSL support:&lt;br /&gt;&lt;pre&gt;aptitude install libgnutls-dev&lt;/pre&gt;Then, compile and install ELinks:&lt;pre&gt;wget http://elinks.or.cz/download/elinks-0.12pre4.tar.bz2&lt;br /&gt;tar -xjvf elinks-0.12pre4.tar.bz2&lt;br /&gt;cd elinks-0.12pre4&lt;br /&gt;./configure --with-gnutls&lt;br /&gt;make&lt;br /&gt;make install&lt;/pre&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_470YrbjRO00/SkLp_swf-4I/AAAAAAAAABQ/jYyaCq0UXlc/s1600-h/elinks_homepage.png"&gt;&lt;img style="cursor: pointer; width: 320px; height: 277px;" src="http://2.bp.blogspot.com/_470YrbjRO00/SkLp_swf-4I/AAAAAAAAABQ/jYyaCq0UXlc/s320/elinks_homepage.png" alt="" id="BLOGGER_PHOTO_ID_5351096587825183618" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6862015143155093240-8276223240508955900?l=spidering-lessons.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://spidering-lessons.blogspot.com/feeds/8276223240508955900/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://spidering-lessons.blogspot.com/2009/06/how-to-compile-elinks-from-source-on.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6862015143155093240/posts/default/8276223240508955900'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6862015143155093240/posts/default/8276223240508955900'/><link rel='alternate' type='text/html' href='http://spidering-lessons.blogspot.com/2009/06/how-to-compile-elinks-from-source-on.html' title='How to compile ELinks from source on Debian or Ubuntu'/><author><name>Damien Bezborodov</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://2.bp.blogspot.com/_470YrbjRO00/SkIcrDkRZzI/AAAAAAAAAAo/faARWQek850/S220/p6240078.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_470YrbjRO00/SkLp_swf-4I/AAAAAAAAABQ/jYyaCq0UXlc/s72-c/elinks_homepage.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6862015143155093240.post-6437199740209226557</id><published>2009-06-24T21:25:00.000+09:30</published><updated>2009-06-24T22:59:33.233+09:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='spidering'/><category scheme='http://www.blogger.com/atom/ns#' term='tutorial'/><category scheme='http://www.blogger.com/atom/ns#' term='wget'/><category scheme='http://www.blogger.com/atom/ns#' term='crawling'/><category scheme='http://www.blogger.com/atom/ns#' term='database'/><title type='text'>Spidering 104: How to crawl a Web-site into a MySQL database (coming soon)</title><content type='html'>You were introduced to wget in lesson 1011. Now we are going to put the fetched documents into a database with meta information such as the URL, retrieval date, outgoing links, etc. This will provide a simple point of integration into other parts of your application suite and an indexed table to lookup your data. Indexed data is stored in Random Access Memory for quick retrieval.&lt;br /&gt;&lt;br /&gt;Coming soon...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6862015143155093240-6437199740209226557?l=spidering-lessons.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://spidering-lessons.blogspot.com/feeds/6437199740209226557/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://spidering-lessons.blogspot.com/2009/06/spidering-104-how-to-crawl-web-site.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6862015143155093240/posts/default/6437199740209226557'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6862015143155093240/posts/default/6437199740209226557'/><link rel='alternate' type='text/html' href='http://spidering-lessons.blogspot.com/2009/06/spidering-104-how-to-crawl-web-site.html' title='Spidering 104: How to crawl a Web-site into a MySQL database (coming soon)'/><author><name>Damien Bezborodov</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://2.bp.blogspot.com/_470YrbjRO00/SkIcrDkRZzI/AAAAAAAAAAo/faARWQek850/S220/p6240078.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6862015143155093240.post-7355271294872706486</id><published>2009-06-24T20:47:00.000+09:30</published><updated>2009-06-24T22:27:34.262+09:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='http'/><category scheme='http://www.blogger.com/atom/ns#' term='spidering'/><category scheme='http://www.blogger.com/atom/ns#' term='tutorial'/><category scheme='http://www.blogger.com/atom/ns#' term='user-agent'/><category scheme='http://www.blogger.com/atom/ns#' term='tcpdump'/><category scheme='http://www.blogger.com/atom/ns#' term='crawling'/><category scheme='http://www.blogger.com/atom/ns#' term='wireshark'/><category scheme='http://www.blogger.com/atom/ns#' term='firebug'/><title type='text'>Spidering 103: How to analyse HTTP traffic</title><content type='html'>Analysing HTTP traffic is useful for discovering the personality of Web-sites.&lt;br /&gt;&lt;br /&gt;You can analyse HTTP traffic with the &lt;a href="http://getfirebug.com/"&gt;Firebug&lt;/a&gt; extension for &lt;a href="http://www.mozilla.com/firefox/"&gt;Mozilla Firefox&lt;/a&gt;. If you need to pretend to be using Internet Explorer for some reason, you can use a &lt;a href="https://addons.mozilla.org/firefox/addon/59"&gt;User Agent Switcher&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.wireshark.org/"&gt;Wireshark&lt;/a&gt; and &lt;a href="http://www.tcpdump.org/"&gt;tcpdump&lt;/a&gt; are also useful, but may be annoying when trying to analyse HTTPS and do not provide integration with your Web browser.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6862015143155093240-7355271294872706486?l=spidering-lessons.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://spidering-lessons.blogspot.com/feeds/7355271294872706486/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://spidering-lessons.blogspot.com/2009/06/spidering-103-how-to-analyse-http.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6862015143155093240/posts/default/7355271294872706486'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6862015143155093240/posts/default/7355271294872706486'/><link rel='alternate' type='text/html' href='http://spidering-lessons.blogspot.com/2009/06/spidering-103-how-to-analyse-http.html' title='Spidering 103: How to analyse HTTP traffic'/><author><name>Damien Bezborodov</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://2.bp.blogspot.com/_470YrbjRO00/SkIcrDkRZzI/AAAAAAAAAAo/faARWQek850/S220/p6240078.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6862015143155093240.post-3108969476267878491</id><published>2009-06-24T20:26:00.001+09:30</published><updated>2009-06-27T15:42:34.021+09:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='business'/><title type='text'>(Off Topic) How to measure the performance of your application development team</title><content type='html'>If you ever end up working for a company that implements &lt;a href="http://en.wikipedia.org/wiki/Key_performance_indicators"&gt;Key Performance Indicators&lt;/a&gt;, the below objectives, actions and KPIs (measures) may be useful for defining your role as an Analyst Programmer:&lt;br /&gt;&lt;h4&gt;Objective 1: Design Simple Solutions That Meet Requirements&lt;/h4&gt;&lt;h5&gt;Actions:&lt;/h5&gt;&lt;ul&gt;&lt;li&gt;Keep the design simple&lt;/li&gt;&lt;li&gt;Use known &lt;a href="http://books.google.com/books?id=5iV5HgAACAAJ"&gt;design patterns&lt;/a&gt; where appropriate&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Split design into comprehensive components&lt;/li&gt;&lt;li&gt;Focus on deliverables  during design phase&lt;/li&gt;&lt;li&gt;Maintain a high degree of discussion between developers&lt;/li&gt;&lt;li&gt;Maintain communication with  business owners&lt;/li&gt;&lt;li&gt;Maintain communication with &lt;a href="http://en.wikipedia.org/wiki/Domain_expert"&gt;domain  experts&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Seek feedback from business owners  early&lt;/li&gt;&lt;li&gt;Use mock-ups where appropriate to  convey concepts clearly&lt;/li&gt;&lt;/ul&gt;&lt;h5&gt;Measures:&lt;/h5&gt;&lt;ul&gt;&lt;li&gt;Perceived visibility to design process&lt;/li&gt;&lt;li&gt;Discuss effectiveness of design in  post-release meeting&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;Objective 2: Efficiently Implement Maintainable &amp;amp; Reliable Solutions&lt;/h4&gt;&lt;h5&gt;Actions:&lt;/h5&gt;&lt;ul&gt;&lt;li&gt;Write &lt;a href="http://en.wikipedia.org/wiki/Unit_testing"&gt;unit tests&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Comment code (where appropriate; do not add useless comments)&lt;/li&gt;&lt;li&gt;Document APIs&lt;/li&gt;&lt;li&gt;Think carefully when naming  classes, methods, properties etc&lt;/li&gt;&lt;li&gt;Use the &lt;a href="http://trac.edgewall.org/"&gt;Issue Tracker&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Split work into smaller tasks&lt;/li&gt;&lt;li&gt;Plan all work (do not skip the  design phase)&lt;/li&gt;&lt;li&gt;Conduct peer review&lt;/li&gt;&lt;li&gt;Communicate with other developers&lt;/li&gt;&lt;li&gt;Ensure testing processes are  followed&lt;/li&gt;&lt;li&gt;Do not over-engineer  implementations (KISS: Keep It Simple, Stupid)&lt;/li&gt;&lt;li&gt;Produce efficient applications  (optimisation, high performance, scalability, lower hardware costs)&lt;/li&gt;&lt;/ul&gt;&lt;h5&gt;Measures:&lt;/h5&gt;&lt;ul&gt;&lt;li&gt;Assess code readability and  documentation&lt;/li&gt;&lt;li&gt;Unit test coverage reports&lt;/li&gt;&lt;li&gt;Feedback from peer review&lt;/li&gt;&lt;li&gt;Assess difficulty in &lt;a href="http://books.google.com/books?id=1MsETFPD3I0C"&gt;refactoring&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Assess perceived &lt;a href="http://en.wikipedia.org/wiki/Software_regression"&gt;regression&lt;/a&gt; (in  relation to scope) that is the result of not following process&lt;/li&gt;&lt;li&gt;Assess actual regression (in  relation to scope) that is the result of not following process&lt;/li&gt;&lt;li&gt;Assess responsiveness and  scalability of applications&lt;/li&gt;&lt;li&gt;Assess Issue Tracker usage and organisation  skills&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;Objective 3: Effectively Deliver &amp;amp; Coordinate Projects According to Schedule&lt;/h4&gt;&lt;h5&gt;Actions:&lt;/h5&gt; &lt;ul&gt;&lt;li&gt;Produce reliable time-estimates&lt;/li&gt;&lt;li&gt;Maintain high-visibility with a  &lt;a href="http://en.wikipedia.org/wiki/Gantt_chart"&gt;Gantt chart&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Split work into comprehensive  tasks&lt;/li&gt;&lt;li&gt;Be organised  &lt;/li&gt;&lt;li&gt;Hold meetings  &lt;/li&gt;&lt;li&gt;Communicate with business owners  &lt;/li&gt;&lt;li&gt;Do not accept changes without a  formal change request  &lt;/li&gt;&lt;li&gt;Any changes to the design must be  reflected in the schedule  &lt;/li&gt;&lt;li&gt;Plan releases and patches &lt;/li&gt;&lt;/ul&gt; &lt;h5&gt;Measures:&lt;/h5&gt; &lt;ul&gt;&lt;li&gt;Actual adherence to schedule  &lt;/li&gt;&lt;li&gt;Perceived adherence to schedule  (visibility)  &lt;/li&gt;&lt;li&gt;Assess &lt;a href="http://en.wikipedia.org/wiki/Scope_creep"&gt;scope creep&lt;/a&gt; &lt;/li&gt;&lt;/ul&gt; &lt;h4&gt;Objective 4: Maintain Reliability of Deployed Systems&lt;/h4&gt; &lt;h5&gt;Actions:&lt;/h5&gt; &lt;ul&gt;&lt;li&gt;Apply risk management discipline  &lt;/li&gt;&lt;li&gt;Opt for low-risk, pragmatic  solutions  &lt;/li&gt;&lt;li&gt;Investigate alternative (“proper”)  solutions regularly  &lt;/li&gt;&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Bug_triage"&gt;Triage&lt;/a&gt; new  bugs  &lt;/li&gt;&lt;li&gt;Reduce key-person dependencies by  sharing knowledge and using documentation  &lt;/li&gt;&lt;li&gt;Develop maintainable solutions  &lt;/li&gt;&lt;li&gt;Develop reliable solutions &lt;/li&gt;&lt;/ul&gt; &lt;h5&gt;Measures:&lt;/h5&gt; &lt;ul&gt;&lt;li&gt;Assess actual regression from  &lt;a href="http://en.wikipedia.org/wiki/Patch_%28computing%29"&gt;patches&lt;/a&gt;  &lt;/li&gt;&lt;li&gt;Assess perceived regression from  patches  &lt;/li&gt;&lt;li&gt;Assess difficulty in understanding  code  &lt;/li&gt;&lt;li&gt;Assess difficulty in refactoring  and making changes to code  &lt;/li&gt;&lt;li&gt;Assess the &lt;a href="http://en.wikipedia.org/wiki/Bus_factor"&gt;Bus factor&lt;/a&gt;.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Assess issue resolution handling &lt;/li&gt;&lt;/ul&gt; &lt;h4&gt;Objective 5: Foster Positive Customer Relationships&lt;/h4&gt; &lt;h5&gt;Actions:&lt;/h5&gt; &lt;ul&gt;&lt;li&gt;Manage meetings effectively  &lt;/li&gt;&lt;li&gt;Communicate with customers via a Technical &lt;a href="http://en.wikipedia.org/wiki/Web_portal"&gt;Portal&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Triage issues and maintain  communication &lt;/li&gt;&lt;/ul&gt; &lt;h5&gt;Measures:&lt;/h5&gt; &lt;ul&gt;&lt;li&gt;Customer feedback &lt;/li&gt;&lt;/ul&gt; &lt;h4&gt;Objective 6: Self Development&lt;/h4&gt; &lt;h5&gt;Actions:&lt;/h5&gt; &lt;ul&gt;&lt;li&gt;Seek feedback from other  developers  &lt;/li&gt;&lt;li&gt;Approach new roles within the team  &lt;/li&gt;&lt;li&gt;Further your own understanding  &lt;/li&gt;&lt;li&gt;Raise questions and/or suggestions  at weekly meetings  &lt;/li&gt;&lt;li&gt;Demonstrate initiative (improve  process, etc) &lt;/li&gt;&lt;/ul&gt; &lt;h5&gt;Measures:&lt;/h5&gt; &lt;ul&gt;&lt;li&gt;Peer review  &lt;/li&gt;&lt;li&gt;Team meetings  &lt;/li&gt;&lt;li&gt;Assess initiatives undertaken  &lt;/li&gt;&lt;li&gt;Assess courses undertaken that  were relevant to the business &lt;/li&gt;&lt;/ul&gt; &lt;h4&gt;Objective 7: OH&amp;amp;S&lt;/h4&gt; &lt;h5&gt;Actions:&lt;/h5&gt; &lt;ul&gt;&lt;li&gt;Use ergonomic human interface  devices to avoid &lt;a href="http://en.wikipedia.org/wiki/Repetitive_strain_injury"&gt;RSI&lt;/a&gt;  &lt;/li&gt;&lt;li&gt;Maintain a healthy posture  &lt;/li&gt;&lt;li&gt;Take frequent breaks away from the  computer (split up your work day)  &lt;/li&gt;&lt;li&gt;Avoid exposure to noisy hardware  &lt;/li&gt;&lt;li&gt;Avoid eye strain &lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6862015143155093240-3108969476267878491?l=spidering-lessons.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://spidering-lessons.blogspot.com/feeds/3108969476267878491/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://spidering-lessons.blogspot.com/2009/06/how-to-measure-performance-of-your-it.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6862015143155093240/posts/default/3108969476267878491'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6862015143155093240/posts/default/3108969476267878491'/><link rel='alternate' type='text/html' href='http://spidering-lessons.blogspot.com/2009/06/how-to-measure-performance-of-your-it.html' title='(Off Topic) How to measure the performance of your application development team'/><author><name>Damien Bezborodov</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://2.bp.blogspot.com/_470YrbjRO00/SkIcrDkRZzI/AAAAAAAAAAo/faARWQek850/S220/p6240078.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6862015143155093240.post-3840738943777096702</id><published>2009-06-24T20:14:00.001+09:30</published><updated>2009-06-24T22:29:00.600+09:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='xhtml'/><category scheme='http://www.blogger.com/atom/ns#' term='parser'/><category scheme='http://www.blogger.com/atom/ns#' term='html tidy'/><category scheme='http://www.blogger.com/atom/ns#' term='dom'/><category scheme='http://www.blogger.com/atom/ns#' term='spidering'/><category scheme='http://www.blogger.com/atom/ns#' term='xpath'/><category scheme='http://www.blogger.com/atom/ns#' term='xml'/><category scheme='http://www.blogger.com/atom/ns#' term='tutorial'/><category scheme='http://www.blogger.com/atom/ns#' term='regex'/><category scheme='http://www.blogger.com/atom/ns#' term='crawling'/><category scheme='http://www.blogger.com/atom/ns#' term='html'/><title type='text'>Spidering 1012: How to transform malformed HTML into easy to use XML (XHTML)</title><content type='html'>Step 1, Use &lt;a href="http://tidy.sourceforge.net/"&gt;HTML Tidy&lt;/a&gt; to transform the HTML document into XHTML:&lt;br /&gt;&lt;pre&gt;tidy -asxhtml &amp;lt; bad.html &gt; good.html&lt;/pre&gt;Now, Tidy sometimes fails on bad data (such as binary code)! No worries: this is where you manually write a script that removes any bad data that HTML Tidy chokes on. You will need to do string replacement of the bad data with a blank string. You may find &lt;a href="http://perldoc.perl.org/perlre.html"&gt;regular expressions&lt;/a&gt; useful where the string to replace varies. Then, pipe it into Tidy as usual!&lt;br /&gt;&lt;br /&gt;Now we have nice clean XHTML that we can parse with an XML parser an manipulate very easily with DOM and XPath! Enjoy!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6862015143155093240-3840738943777096702?l=spidering-lessons.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://spidering-lessons.blogspot.com/feeds/3840738943777096702/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://spidering-lessons.blogspot.com/2009/06/spidering-1012-how-to-tranform-shitty.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6862015143155093240/posts/default/3840738943777096702'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6862015143155093240/posts/default/3840738943777096702'/><link rel='alternate' type='text/html' href='http://spidering-lessons.blogspot.com/2009/06/spidering-1012-how-to-tranform-shitty.html' title='Spidering 1012: How to transform malformed HTML into easy to use XML (XHTML)'/><author><name>Damien Bezborodov</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://2.bp.blogspot.com/_470YrbjRO00/SkIcrDkRZzI/AAAAAAAAAAo/faARWQek850/S220/p6240078.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6862015143155093240.post-5702134370424727904</id><published>2009-06-24T19:54:00.001+09:30</published><updated>2009-06-25T13:05:16.072+09:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='spidering'/><category scheme='http://www.blogger.com/atom/ns#' term='tutorial'/><category scheme='http://www.blogger.com/atom/ns#' term='wget'/><category scheme='http://www.blogger.com/atom/ns#' term='crawling'/><title type='text'>Spidering 1011: How to fetch an entire Web-site with wget</title><content type='html'>"&lt;a href="http://www.gnu.org/software/wget/"&gt;GNU Wget&lt;/a&gt; is a free utility for non-interactive download of files from the Web.  It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies."&lt;br /&gt;&lt;br /&gt;Basic wget usage is to fetch a single URL:&lt;br /&gt;&lt;pre&gt;wget http://example.org/&lt;/pre&gt;But, we want to spider an entire site &lt;span style="font-style: italic;"&gt;recursively&lt;/span&gt;! The manual is most useful for anybody who is competent:&lt;br /&gt;&lt;pre&gt;man wget&lt;/pre&gt;You will want to use the &lt;span style="font-weight: bold;"&gt;--recursive&lt;/span&gt; and &lt;span style="font-weight: bold;"&gt;--level&lt;/span&gt; options.&lt;br /&gt;&lt;br /&gt;Some HTTP daemons block strange user agents. You can masquerade as an ordinary browser with &lt;span style="font-weight: bold;"&gt;--user-agent&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Some Web scripts block requests that do not have a referrer (you must click a link to the URL and not access it directly). You can pretend that you were referred from a page with &lt;span style="font-weight: bold;"&gt;--referrer&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Other useful options for recursive spidering:&lt;br /&gt;&lt;ul&gt;&lt;li&gt; &lt;span style="font-weight: bold;"&gt;--accept/--reject:&lt;/span&gt; Specify comma-separated lists of file name suffixes or patterns to accept or reject.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;--domains/--exclude-domains&lt;/span&gt;: Set domains to be followed.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;--follow-tags/--ignore-tags:&lt;/span&gt; Wget has an internal table of HTML tag / attribute pairs that it considers when looking for linked documents during a recursive retrieval. If a user wants only a subset of those tags to be considered, however, he or she should be specify such tags in a comma-separated list with this option.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;--span-hosts:&lt;/span&gt; Enable spanning across hosts when doing recursive retrieving.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;--no-parent:&lt;/span&gt; Do not ever ascend to the parent directory when retrieving recursively.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Off course, there are many more options that will be useful such as &lt;span style="font-weight: bold;"&gt;--include-directories&lt;/span&gt; any many more!&lt;br /&gt;&lt;br /&gt;&lt;a href="http://lynx.isc.org/"&gt;Lynx (a text-based Web browser)&lt;/a&gt; is also as useful tool.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6862015143155093240-5702134370424727904?l=spidering-lessons.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://spidering-lessons.blogspot.com/feeds/5702134370424727904/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://spidering-lessons.blogspot.com/2009/06/spidering-1011-how-to-fetch-entire-web.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6862015143155093240/posts/default/5702134370424727904'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6862015143155093240/posts/default/5702134370424727904'/><link rel='alternate' type='text/html' href='http://spidering-lessons.blogspot.com/2009/06/spidering-1011-how-to-fetch-entire-web.html' title='Spidering 1011: How to fetch an entire Web-site with wget'/><author><name>Damien Bezborodov</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://2.bp.blogspot.com/_470YrbjRO00/SkIcrDkRZzI/AAAAAAAAAAo/faARWQek850/S220/p6240078.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6862015143155093240.post-4784694327526565086</id><published>2009-06-24T19:10:00.000+09:30</published><updated>2009-06-24T22:33:53.185+09:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='parser'/><category scheme='http://www.blogger.com/atom/ns#' term='dom'/><category scheme='http://www.blogger.com/atom/ns#' term='debian'/><category scheme='http://www.blogger.com/atom/ns#' term='spidering'/><category scheme='http://www.blogger.com/atom/ns#' term='regex'/><category scheme='http://www.blogger.com/atom/ns#' term='xpath'/><category scheme='http://www.blogger.com/atom/ns#' term='wget'/><category scheme='http://www.blogger.com/atom/ns#' term='crawling'/><category scheme='http://www.blogger.com/atom/ns#' term='perl'/><category scheme='http://www.blogger.com/atom/ns#' term='html tidy'/><category scheme='http://www.blogger.com/atom/ns#' term='tutorial'/><category scheme='http://www.blogger.com/atom/ns#' term='xml'/><category scheme='http://www.blogger.com/atom/ns#' term='database'/><title type='text'>Spidering 101: Meet your toolbox</title><content type='html'>Useful tools for spidering:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.debian.org/"&gt;A decent operating system&lt;/a&gt;.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.mysql.com/"&gt;A database&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span&gt;&lt;span&gt; for storing your data and organising it into processing queues.&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://tidy.sourceforge.net/"&gt;HTML Tidy&lt;/a&gt; for parsing malformed HTML into well-formed XHTML.&lt;/li&gt;&lt;li&gt;A &lt;a href="http://www.perl.org/"&gt;scripting language that supports regular expressions&lt;/a&gt; for doing dirty work.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="display: block;" id="formatbar_Buttons"&gt;&lt;span class="on down" style="display: block;" id="formatbar_CreateLink" title="Link" onmouseover="ButtonHoverOn(this);" onmouseout="ButtonHoverOff(this);" onmouseup="" onmousedown="CheckFormatting(event);FormatbarButton('richeditorframe', this, 8);ButtonMouseDown(this);"&gt;An &lt;a href="http://search.cpan.org/%7Eenno/libxml-enno/"&gt;XML parser&lt;/a&gt; that supports &lt;a href="http://search.cpan.org/%7Etjmather/XML-DOM/"&gt;DOM&lt;/a&gt; and &lt;a href="http://search.cpan.org/%7Emsergeant/XML-XPath/"&gt;XPath&lt;/a&gt;.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="display: block;" id="formatbar_Buttons"&gt;&lt;span class="on down" style="display: block;" id="formatbar_CreateLink" title="Link" onmouseover="ButtonHoverOn(this);" onmouseout="ButtonHoverOff(this);" onmouseup="" onmousedown="CheckFormatting(event);FormatbarButton('richeditorframe', this, 8);ButtonMouseDown(this);"&gt;A decent &lt;a href="http://www.gnu.org/software/wget/"&gt;WWW automation client&lt;/a&gt; for fetching pages and following links.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="display: block;" id="formatbar_Buttons"&gt;&lt;span class="on down" style="display: block;" id="formatbar_CreateLink" title="Link" onmouseover="ButtonHoverOn(this);" onmouseout="ButtonHoverOff(this);" onmouseup="" onmousedown="CheckFormatting(event);FormatbarButton('richeditorframe', this, 8);ButtonMouseDown(this);"&gt;A brain.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="display: block;" id="formatbar_Buttons"&gt;&lt;span class="on down" style="display: block;" id="formatbar_CreateLink" title="Link" onmouseover="ButtonHoverOn(this);" onmouseout="ButtonHoverOff(this);" onmouseup="" onmousedown="CheckFormatting(event);FormatbarButton('richeditorframe', this, 8);ButtonMouseDown(this);"&gt;A purpose (of good intent).&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6862015143155093240-4784694327526565086?l=spidering-lessons.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://spidering-lessons.blogspot.com/feeds/4784694327526565086/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://spidering-lessons.blogspot.com/2009/06/spider-101-meet-your-toolbox.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6862015143155093240/posts/default/4784694327526565086'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6862015143155093240/posts/default/4784694327526565086'/><link rel='alternate' type='text/html' href='http://spidering-lessons.blogspot.com/2009/06/spider-101-meet-your-toolbox.html' title='Spidering 101: Meet your toolbox'/><author><name>Damien Bezborodov</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://2.bp.blogspot.com/_470YrbjRO00/SkIcrDkRZzI/AAAAAAAAAAo/faARWQek850/S220/p6240078.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6862015143155093240.post-370593280452801803</id><published>2009-06-24T18:24:00.001+09:30</published><updated>2009-06-25T13:30:11.795+09:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='javascript'/><category scheme='http://www.blogger.com/atom/ns#' term='parser'/><category scheme='http://www.blogger.com/atom/ns#' term='dom'/><category scheme='http://www.blogger.com/atom/ns#' term='slashdot'/><category scheme='http://www.blogger.com/atom/ns#' term='spidering'/><category scheme='http://www.blogger.com/atom/ns#' term='crawling'/><category scheme='http://www.blogger.com/atom/ns#' term='perl'/><category scheme='http://www.blogger.com/atom/ns#' term='recaptcha'/><category scheme='http://www.blogger.com/atom/ns#' term='html tidy'/><category scheme='http://www.blogger.com/atom/ns#' term='tutorial'/><category scheme='http://www.blogger.com/atom/ns#' term='xml'/><category scheme='http://www.blogger.com/atom/ns#' term='google'/><category scheme='http://www.blogger.com/atom/ns#' term='html'/><title type='text'>Spidering 102: How to write a basic script to parse JavaScript-obfuscated email addresses (in under an hour)</title><content type='html'>In response to &lt;a href="http://it.slashdot.org/story/09/06/23/173229/Has-Google-Broken-JavaScript-Spam-Munging"&gt;         Slashdot IT Story | Has Google Broken JavaScript Spam Munging?&lt;/a&gt;&lt;br /&gt;&lt;p style="font-weight: bold;"&gt;&lt;span style="font-style: italic;"&gt;You should use something such as &lt;/span&gt;&lt;a style="font-style: italic;" href="http://recaptcha.net/"&gt;reCAPTCHA&lt;/a&gt;&lt;span style="font-style: italic;"&gt; if you want to be serious about protecting your email address, especially since Google is now reported to be parsing JavaScript according to the above Slashdot story!&lt;/span&gt;&lt;br /&gt;&lt;/p&gt;&lt;h4&gt;Step 1, Create a test file (the hour-timer starts here)&lt;/h4&gt;&lt;p&gt;So, the document is malformed HTML as usual:&lt;/p&gt;&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_470YrbjRO00/SkHqUp_cKiI/AAAAAAAAAAM/rz-3pZfGvbs/s1600-h/jsbot_start.png"&gt;&lt;img style="cursor: pointer; width: 320px; height: 232px;" src="http://2.bp.blogspot.com/_470YrbjRO00/SkHqUp_cKiI/AAAAAAAAAAM/rz-3pZfGvbs/s320/jsbot_start.png" alt="" id="BLOGGER_PHOTO_ID_5350815472883083810" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;h4&gt;Step 2, Write the parser&lt;/h4&gt;         &lt;p&gt;We use a &lt;a href="http://search.cpan.org/%7Eclaesjac/JavaScript/"&gt;Perl extension for executing embedded JavaScript&lt;/a&gt; and &lt;a href="http://search.cpan.org/%7Eenno/libxml-enno/"&gt;XML::DOM::Parser&lt;/a&gt;.&lt;/p&gt;         &lt;pre&gt;$ cat buildEmailScannableList&lt;br /&gt;#!/usr/bin/perl -w&lt;br /&gt;use strict;&lt;br /&gt;use warnings;&lt;br /&gt;use XML::DOM;&lt;br /&gt;use IO::Handle;&lt;br /&gt;use JavaScript;&lt;br /&gt;&lt;br /&gt;my $io = IO::Handle-&gt;new;&lt;br /&gt;$io-&gt;fdopen(fileno(STDIN),"r");&lt;br /&gt;&lt;br /&gt;my @scanQueue = ();&lt;br /&gt;&lt;br /&gt;my $parser = new XML::DOM::Parser;&lt;br /&gt;my $doc = $parser-&gt;parse($io);&lt;br /&gt;&lt;br /&gt;# Get all a-&gt;href&lt;br /&gt;for my $node ($doc-&gt;getElementsByTagName("a")) {&lt;br /&gt;  my $href = $node-&gt;getAttributeNode("href");&lt;br /&gt;  push @scanQueue, $href-&gt;getValue if defined $href;&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;# Get all text content&lt;br /&gt;# ... this is easy, wont bother implementing&lt;br /&gt;&lt;br /&gt;# Get all evaluated javascript (this is the harder bit)&lt;br /&gt;my $rt = JavaScript::Runtime-&gt;new();&lt;br /&gt;my $cx = $rt-&gt;create_context();&lt;br /&gt;$cx-&gt;bind_function("document.write" =&gt; sub { push @scanQueue, @_; });&lt;br /&gt;for my $node ($doc-&gt;getElementsByTagName("script")) {&lt;br /&gt;  my $javascript = $node-&gt;getChildAtIndex(0)-&gt;toString(); # Not robust!&lt;br /&gt;  $cx-&gt;eval($javascript);&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;print "$_\n" for @scanQueue;&lt;br /&gt;&lt;br /&gt;$doc-&gt;dispose;&lt;br /&gt;&lt;/pre&gt;                  &lt;h4&gt;Step 3, Pipe the well-formed document into the parser (timer ends here)&lt;/h4&gt;         &lt;p&gt;We use HTML Tidy to convert the document into well-formed XHTML and pipe that into our parser:&lt;/p&gt;&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_470YrbjRO00/SkHql4BMOlI/AAAAAAAAAAU/U_jToqD0BNQ/s1600-h/jsbot_fini.png"&gt;&lt;img style="cursor: pointer; width: 331px; height: 250px;" src="http://3.bp.blogspot.com/_470YrbjRO00/SkHql4BMOlI/AAAAAAAAAAU/U_jToqD0BNQ/s320/jsbot_fini.png" alt="" id="BLOGGER_PHOTO_ID_5350815768706300498" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;         &lt;h4&gt;Step X, integrate into your text scanning scripts with pipes or a file queue&lt;/h4&gt;         &lt;p&gt;Well, the sky is the limit! Off course, you can easily use wget to do the actual spidering of the Web-site.&lt;/p&gt;         &lt;p&gt;This technique can also be used for harvesting other types of information such as links in dynamic JavaScript menus, etc.         So yes, this can have many legitimate uses.&lt;/p&gt;&lt;p&gt;If you know C, you might also be interested in &lt;a href="http://www.mozilla.org/js/spidermonkey/"&gt;SpiderMonkey&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6862015143155093240-370593280452801803?l=spidering-lessons.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://spidering-lessons.blogspot.com/feeds/370593280452801803/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://spidering-lessons.blogspot.com/2009/06/spidering-102-how-to-write-basic-script.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6862015143155093240/posts/default/370593280452801803'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6862015143155093240/posts/default/370593280452801803'/><link rel='alternate' type='text/html' href='http://spidering-lessons.blogspot.com/2009/06/spidering-102-how-to-write-basic-script.html' title='Spidering 102: How to write a basic script to parse JavaScript-obfuscated email addresses (in under an hour)'/><author><name>Damien Bezborodov</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://2.bp.blogspot.com/_470YrbjRO00/SkIcrDkRZzI/AAAAAAAAAAo/faARWQek850/S220/p6240078.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_470YrbjRO00/SkHqUp_cKiI/AAAAAAAAAAM/rz-3pZfGvbs/s72-c/jsbot_start.png' height='72' width='72'/><thr:total>0</thr:total></entry></feed>
