POV-Ray: Newsgroups: povray.off-topic: gathering infos from web pages: Re: gathering infos from web pages

POV-Ray : Newsgroups : povray.off-topic : gathering infos from web pages : Re: gathering infos from web pages		Server Time 24 Oct 2025 04:35:07 EDT (-0400)

From: Nicolas Alvarez
Date: 21 Nov 2007 10:29:11
Message: <47444ec7$1@news.povray.org>

I would do it with PHP (outside a webserver), because I did many 
scraping scripts that way. It's easy to parse HTML with PHP's DOM and 
loadHTML, handles all the bad syntax for you.

I wrote a 300-line PHP script that parsed some old HTML documentation 
into wiki formatting, submitted it to the wiki, then created a frameset 
page with the original HTML on one side and the new wiki page in edit 
mode on the other, and opened the page in Firefox (possible since the 
script was running locally).

And another I have running permanently on my system that scrapes the 
search page of that wiki searching for spam, and when finding something, 
gets the page history (to know the last revision ID) and deletes the 
last version (repeatedly for everything listed on the search results). 
Even has exponential backoff when finding nothing and a log.

I could give you the source if you want to learn from them :)

If you want to use JS, you can do it if you are on Windows: Windows 
Scripting Host. Run it by double-clicking the .js, or using 'wscript 
file.js' on the command line (or 'cscript' if you want to WScript.Echo 
stuff into the console instead of opening a popup). Remember to get the 
XMLHttpRequest using the ActiveX object, 'new XMLHttpRequest' doesn't 
work (well, maybe it will if you have IE7).

Post a reply to this message