Wednesday, 24 June 2009

Spidering 1012: How to transform malformed HTML into easy to use XML (XHTML)

Step 1, Use HTML Tidy to transform the HTML document into XHTML:
tidy -asxhtml < bad.html > good.html
Now, Tidy sometimes fails on bad data (such as binary code)! No worries: this is where you manually write a script that removes any bad data that HTML Tidy chokes on. You will need to do string replacement of the bad data with a blank string. You may find regular expressions useful where the string to replace varies. Then, pipe it into Tidy as usual!

Now we have nice clean XHTML that we can parse with an XML parser an manipulate very easily with DOM and XPath! Enjoy!

No comments:

Post a Comment