Wednesday, 24 June 2009

Spidering 102: How to write a basic script to parse JavaScript-obfuscated email addresses (in under an hour)

In response to Slashdot IT Story | Has Google Broken JavaScript Spam Munging?

You should use something such as reCAPTCHA if you want to be serious about protecting your email address, especially since Google is now reported to be parsing JavaScript according to the above Slashdot story!

Step 1, Create a test file (the hour-timer starts here)

So, the document is malformed HTML as usual:

Step 2, Write the parser

We use a Perl extension for executing embedded JavaScript and XML::DOM::Parser.

$ cat buildEmailScannableList
#!/usr/bin/perl -w
use strict;
use warnings;
use XML::DOM;
use IO::Handle;
use JavaScript;

my $io = IO::Handle->new;
$io->fdopen(fileno(STDIN),"r");

my @scanQueue = ();

my $parser = new XML::DOM::Parser;
my $doc = $parser->parse($io);

# Get all a->href
for my $node ($doc->getElementsByTagName("a")) {
my $href = $node->getAttributeNode("href");
push @scanQueue, $href->getValue if defined $href;
}

# Get all text content
# ... this is easy, wont bother implementing

# Get all evaluated javascript (this is the harder bit)
my $rt = JavaScript::Runtime->new();
my $cx = $rt->create_context();
$cx->bind_function("document.write" => sub { push @scanQueue, @_; });
for my $node ($doc->getElementsByTagName("script")) {
my $javascript = $node->getChildAtIndex(0)->toString(); # Not robust!
$cx->eval($javascript);
}

print "$_\n" for @scanQueue;

$doc->dispose;

Step 3, Pipe the well-formed document into the parser (timer ends here)

We use HTML Tidy to convert the document into well-formed XHTML and pipe that into our parser:

Step X, integrate into your text scanning scripts with pipes or a file queue

Well, the sky is the limit! Off course, you can easily use wget to do the actual spidering of the Web-site.

This technique can also be used for harvesting other types of information such as links in dynamic JavaScript menus, etc. So yes, this can have many legitimate uses.

If you know C, you might also be interested in SpiderMonkey.

No comments:

Post a Comment