tidy -asxhtml < bad.html > good.htmlNow, Tidy sometimes fails on bad data (such as binary code)! No worries: this is where you manually write a script that removes any bad data that HTML Tidy chokes on. You will need to do string replacement of the bad data with a blank string. You may find regular expressions useful where the string to replace varies. Then, pipe it into Tidy as usual!
Now we have nice clean XHTML that we can parse with an XML parser an manipulate very easily with DOM and XPath! Enjoy!
No comments:
Post a Comment