|
|
I would do it with PHP (outside a webserver), because I did many
scraping scripts that way. It's easy to parse HTML with PHP's DOM and
loadHTML, handles all the bad syntax for you.
I wrote a 300-line PHP script that parsed some old HTML documentation
into wiki formatting, submitted it to the wiki, then created a frameset
page with the original HTML on one side and the new wiki page in edit
mode on the other, and opened the page in Firefox (possible since the
script was running locally).
And another I have running permanently on my system that scrapes the
search page of that wiki searching for spam, and when finding something,
gets the page history (to know the last revision ID) and deletes the
last version (repeatedly for everything listed on the search results).
Even has exponential backoff when finding nothing and a log.
I could give you the source if you want to learn from them :)
If you want to use JS, you can do it if you are on Windows: Windows
Scripting Host. Run it by double-clicking the .js, or using 'wscript
file.js' on the command line (or 'cscript' if you want to WScript.Echo
stuff into the console instead of opening a popup). Remember to get the
XMLHttpRequest using the ActiveX object, 'new XMLHttpRequest' doesn't
work (well, maybe it will if you have IE7).
Post a reply to this message
|
|