POV-Ray : Newsgroups : povray.off-topic : gathering infos from web pages Server Time
11 Oct 2024 13:14:53 EDT (-0400)
  gathering infos from web pages (Message 1 to 10 of 29)  
Goto Latest 10 Messages Next 10 Messages >>>
From: Fa3ien
Subject: gathering infos from web pages
Date: 21 Nov 2007 08:50:05
Message: <4744378d@news.povray.org>
Hi,
in our country, we have a government-operated official website
which publishes opportunities in public tender.

sample URL of a resulting page :
http://www.ejustice.just.fgov.be/cgi_bul/bul_a_1.pl?DETAIL=DETAIL&caller=list&row_id=1&numero=1&rech=472&numac=2007051997&pd=2007-11-16&lg=F&pdf_file=%2Fhome%2Fmon1%2Fbul%2Fimage%2F2007%2F1116_1.pdf&trier=+order+by+numac+desc%2C+pd%3B&language=fr&choix1=ET&choix2=ET&fromtab=BUL&sql=objet+contains++%27architecture%27&objet=architecture

These people don't have an RSS feed availiable, or anything else that would
help us do otherwise than tedious manual checking with keywords every week
(hundreds of offers are published each day).

I'd like to be able to automate the process, so I can produce some kind of
digest from the offers we are likely interested into. The "examine a page
and determine if we are likely to be interested" will be easy.  I have a
problem with the first step : "automatically retrieve every page starting
from a given one".

After some observation and tests, I know how to get the "next offer" by tweaking
the URL string appropriately.  But I need to read the content of the resulting
page to determine when I need to change the date ('pd' in the query) so I can
continue incrementing the numbering ('numac' in query).  That's where it goes bad.

I tought 'well, just do some javascript, put the content of the url in an iframe,
read it, and act accordingly'.  Done that. It doesn't work. Why ?  The XMLHTTPRequest
function, which is used to put the content of the iframe in a string, is prohibited
(in any browser in existence) to work with content from another domain. Ouch !
I found some GreaseMonkey script which pretended to allow bypass of this "cross-domain
policy", but it didn't work.

So I'm still at the start of this seemingly simple project.  I'm currently thinking
of getting the pages with WGET, but can I pilot WGET from Javascript ? Or should
I try another language ?  Or a completely different path ?

Ideas ?

TIA,
Fabien.


Post a reply to this message

From: Invisible
Subject: Re: gathering infos from web pages
Date: 21 Nov 2007 09:08:32
Message: <47443be0$1@news.povray.org>
I would imagine you'll have endless problems trying to get round 
security issues if you try to script this from inside a web browser.

My suggestion would be to move to another programming language that has 
a HTTP library and try to do the stuff you want from there. It'll 
probably be much easier.

Obviously I recommend Haskell for this task - and, obviously, you're 
going to say no. ;-) That being the case, I'm pretty confident that Perl 
/ Python / Ruby / Tcl / any of those hackish scripting languages will 
have a library that makes this reasonably easy.


Post a reply to this message

From: Vincent Le Chevalier
Subject: Re: gathering infos from web pages
Date: 21 Nov 2007 09:11:57
Message: <47443cad$1@news.povray.org>

> So I'm still at the start of this seemingly simple project.  I'm 
> currently thinking
> of getting the pages with WGET, but can I pilot WGET from Javascript ? 
> Or should
> I try another language ?  Or a completely different path ?
> 
> Ideas ?
> 

If I had to do this, I think I would write a script outside a web page, 
in whatever language able to download files on the web (I'm sure Andrew 
will come up with something in Haskell ;-) ). Then of course this script 
could generate an HTML page as its output.

Or you could probably do such things in PHP, if you want to have it 
driven by a web page consultation.

But I'm no expert in these matters... Never done that before.

-- 
Vincent


Post a reply to this message

From: Fa3ien
Subject: Re: gathering infos from web pages
Date: 21 Nov 2007 09:26:49
Message: <47444029$1@news.povray.org>

> I would imagine you'll have endless problems trying to get round 
> security issues if you try to script this from inside a web browser.
> 
> My suggestion would be to move to another programming language that has 
> a HTTP library and try to do the stuff you want from there. It'll 
> probably be much easier.
> 
> Obviously I recommend Haskell for this task - and, obviously, you're 
> going to say no. ;-) 

How did you guess ?

> That being the case, I'm pretty confident that Perl 
> / Python / Ruby / Tcl / any of those hackish scripting languages will 
> have a library that makes this reasonably easy.

I'm tempted to get a hand on Ruby, for various reasons. Maybe I can
do it in Lisp... At first, I rejected the idea because it would
need AutoCAD, but, no, there might be some free LISP intepreter,
I should check.

In fact, it doesn't even need a http library, if I can shell to
Wget and read the downloaded file...

Fabien.


Post a reply to this message

From: Gilles Tran
Subject: Re: gathering infos from web pages
Date: 21 Nov 2007 09:47:32
Message: <47444504@news.povray.org>

4744378d@news.povray.org...

> So I'm still at the start of this seemingly simple project.  I'm currently 
> thinking
> of getting the pages with WGET, but can I pilot WGET from Javascript ? Or 
> should
> I try another language ?  Or a completely different path ?

This could be done in PHP, something like this:

<?
$url = 'yoururl';
$needle = 'Architecture';
$haystack = file_get_contents($url);
if(strpos($haystack, $needle)!== false) {
echo 'found';
} else {
echo 'not found';
}
?>

You need to automate the url by passing the right keywords.

G.


Post a reply to this message

From: Invisible
Subject: Re: gathering infos from web pages
Date: 21 Nov 2007 09:58:15
Message: <47444787$1@news.povray.org>
Fa3ien wrote:

>> Obviously I recommend Haskell for this task - and, obviously, you're 
>> going to say no. ;-) 
> 
> How did you guess ?

Mmm, because everybody hates Haskell? ;-)

>> That being the case, I'm pretty confident that Perl / Python / Ruby / 
>> Tcl / any of those hackish scripting languages will have a library 
>> that makes this reasonably easy.
> 
> I'm tempted to get a hand on Ruby, for various reasons. Maybe I can
> do it in Lisp... At first, I rejected the idea because it would
> need AutoCAD, but, no, there might be some free LISP intepreter,
> I should check.

I'm pretty sure I looked into this myself, and found that there are 
indeed free Common Lisp interpreters out there.

(And there's always emacs... bahahaha!)

> In fact, it doesn't even need a http library, if I can shell to
> Wget and read the downloaded file...

Yeah, that's true. Probably easier this way if there isn't a strong HTTP 
library already available. wget will already cover all the important 
edge cases...


Post a reply to this message

From: Fa3ien
Subject: Re: gathering infos from web pages
Date: 21 Nov 2007 10:23:46
Message: <47444d82$1@news.povray.org>

> Fa3ien wrote: 
> 
>>> Obviously I recommend Haskell for this task - and, obviously, you're 
>>> going to say no. ;-) 
>>
>> How did you guess ?
> 
> Mmm, because everybody hates Haskell? ;-)

Fears, not hate.  Personally, whenever you post Haskell code, I'm pretty
admirative of the powerfulness of what you say it does with such concise
code.  But I am also scared by the fact that I don't understand a bit
about how what it does relates to what the code looks like.

Since my first message (about 1:30 ago), I tried Ruby.  After 20 minutes
of an online hands-on tutorial (http://tryruby.hobix.com/), 5 minutes
checking for "ruby http library", and 5 minutes installing a Ruby
interpreter for Windows, I've been delighted to see that this line of code,
build with my thin newly acquired knowledge :

print Net::HTTP.get(URI.parse("http://www.google.be"))

produced exactly what I expected it to do ! (putting the content
of a web page in a string)

Ruby is a gem !

Fabien.


Post a reply to this message

From: Nicolas Alvarez
Subject: Re: gathering infos from web pages
Date: 21 Nov 2007 10:29:11
Message: <47444ec7$1@news.povray.org>
I would do it with PHP (outside a webserver), because I did many 
scraping scripts that way. It's easy to parse HTML with PHP's DOM and 
loadHTML, handles all the bad syntax for you.

I wrote a 300-line PHP script that parsed some old HTML documentation 
into wiki formatting, submitted it to the wiki, then created a frameset 
page with the original HTML on one side and the new wiki page in edit 
mode on the other, and opened the page in Firefox (possible since the 
script was running locally).

And another I have running permanently on my system that scrapes the 
search page of that wiki searching for spam, and when finding something, 
gets the page history (to know the last revision ID) and deletes the 
last version (repeatedly for everything listed on the search results). 
Even has exponential backoff when finding nothing and a log.

I could give you the source if you want to learn from them :)

If you want to use JS, you can do it if you are on Windows: Windows 
Scripting Host. Run it by double-clicking the .js, or using 'wscript 
file.js' on the command line (or 'cscript' if you want to WScript.Echo 
stuff into the console instead of opening a popup). Remember to get the 
XMLHttpRequest using the ActiveX object, 'new XMLHttpRequest' doesn't 
work (well, maybe it will if you have IE7).


Post a reply to this message

From: Invisible
Subject: Re: gathering infos from web pages
Date: 21 Nov 2007 10:46:43
Message: <474452e3$1@news.povray.org>
Fa3ien wrote:

>> Mmm, because everybody hates Haskell? ;-)
> 
> Fears, not hate.

Ah, OK. I rephrase then: *most* people hate Haskell. The rest just ph33r 
it. ;-)

> Personally, whenever you post Haskell code, I'm pretty
> admirative of the powerfulness of what you say it does with such concise
> code.  But I am also scared by the fact that I don't understand a bit
> about how what it does relates to what the code looks like.

It seems Haskell has both the power to be completely transparent, and 
also entirely opaque. A bit like mathematical formulas, really!

> I've been delighted to see that this line of code,
> build with my thin newly acquired knowledge :
> 
> print Net::HTTP.get(URI.parse("http://www.google.be"))
> 
> produced exactly what I expected it to do ! (putting the content
> of a web page in a string)

Yeah. In Haskell you'd have to spend a few minutes installing GHC, a few 
more minutes downloading and compiling the 3rd party HTTP library, and 
then you'd have to write something like

   let uri = fromMaybe $ parseURI "http://www.google.be"
   maybePage <- httpGet uri
   let page = fromMaybe maybePage

(And replace those fromMaybe calls with some slightly larger construct 
if you actually want to do real error handling.)

> Ruby is a gem !

LOL! I bet you're not the first to think that one up...



Alternatively, if you feel ill, you might try to write the Haskell 
version as

   page <- (httpGet $ fromMaybe $ parseURI "http://www.google.be") >>= 
(return . fromMaybe)

Certainly I can see where the "scary" issue comes from...


Post a reply to this message

From: Phil Cook
Subject: Re: gathering infos from web pages
Date: 21 Nov 2007 11:44:05
Message: <op.t15hvnhqc3xi7v@news.povray.org>
And lo on Wed, 21 Nov 2007 13:51:37 -0000, Fa3ien  
<fab### [at] yourshoesskynetbe> did spake, saying:

<snip>
> I tought 'well, just do some javascript, put the content of the url in  
> an iframe, read it, and act accordingly'.  Done that. It doesn't work.  
> Why ?  The XMLHTTPRequest function, which is used to put the content of  
> the iframe in a string, is prohibited (in any browser in existence) to  
> work with content from another domain. Ouch !

If it's any help I know IE6 didn't have this security restriction, but  
that 'hole' may have been plugged now.

> I found some GreaseMonkey script which pretended to allow bypass of this  
> "cross-domain policy", but it didn't work.
>
> So I'm still at the start of this seemingly simple project.  I'm  
> currently thinking of getting the pages with WGET, but can I pilot WGET  
> from Javascript ? Or should I try another language ?  Or a completely  
> different path ?

Depends what you've got to work with and how it's going to be applied. If  
you've a PHP server then as Gilles said that's your best bet, otherwise  
you're running a 'script' directly.

-- 
Phil Cook

--
I once tried to be apathetic, but I just couldn't be bothered
http://flipc.blogspot.com


Post a reply to this message

Goto Latest 10 Messages Next 10 Messages >>>

Copyright 2003-2023 Persistence of Vision Raytracer Pty. Ltd.