|
|
Nicolas Alvarez wrote:
> I would do it with PHP (outside a webserver), because I did many
> scraping scripts that way. It's easy to parse HTML with PHP's DOM and
> loadHTML, handles all the bad syntax for you.
As long as you start a new process for each page, you'll be OK. From
what I can tell, PHP never, ever deallocates memory. Try walking thru
and processing a 600-megaline database table in CLI PHP, and you'll
regret it.
You could write one that sucks up URLs (or runs wget), then iterates
over the resulting files with one PHP script each or something.
Or use Tcl, which is what I did.
--
Darren New / San Diego, CA, USA (PST)
It's not feature creep if you put it
at the end and adjust the release date.
Post a reply to this message
|
|