|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Hi,
in our country, we have a government-operated official website
which publishes opportunities in public tender.
sample URL of a resulting page :
http://www.ejustice.just.fgov.be/cgi_bul/bul_a_1.pl?DETAIL=DETAIL&caller=list&row_id=1&numero=1&rech=472&numac=2007051997&pd=2007-11-16&lg=F&pdf_file=%2Fhome%2Fmon1%2Fbul%2Fimage%2F2007%2F1116_1.pdf&trier=+order+by+numac+desc%2C+pd%3B&language=fr&choix1=ET&choix2=ET&fromtab=BUL&sql=objet+contains++%27architecture%27&objet=architecture
These people don't have an RSS feed availiable, or anything else that would
help us do otherwise than tedious manual checking with keywords every week
(hundreds of offers are published each day).
I'd like to be able to automate the process, so I can produce some kind of
digest from the offers we are likely interested into. The "examine a page
and determine if we are likely to be interested" will be easy. I have a
problem with the first step : "automatically retrieve every page starting
from a given one".
After some observation and tests, I know how to get the "next offer" by tweaking
the URL string appropriately. But I need to read the content of the resulting
page to determine when I need to change the date ('pd' in the query) so I can
continue incrementing the numbering ('numac' in query). That's where it goes bad.
I tought 'well, just do some javascript, put the content of the url in an iframe,
read it, and act accordingly'. Done that. It doesn't work. Why ? The XMLHTTPRequest
function, which is used to put the content of the iframe in a string, is prohibited
(in any browser in existence) to work with content from another domain. Ouch !
I found some GreaseMonkey script which pretended to allow bypass of this "cross-domain
policy", but it didn't work.
So I'm still at the start of this seemingly simple project. I'm currently thinking
of getting the pages with WGET, but can I pilot WGET from Javascript ? Or should
I try another language ? Or a completely different path ?
Ideas ?
TIA,
Fabien.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
I would imagine you'll have endless problems trying to get round
security issues if you try to script this from inside a web browser.
My suggestion would be to move to another programming language that has
a HTTP library and try to do the stuff you want from there. It'll
probably be much easier.
Obviously I recommend Haskell for this task - and, obviously, you're
going to say no. ;-) That being the case, I'm pretty confident that Perl
/ Python / Ruby / Tcl / any of those hackish scripting languages will
have a library that makes this reasonably easy.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
> So I'm still at the start of this seemingly simple project. I'm
> currently thinking
> of getting the pages with WGET, but can I pilot WGET from Javascript ?
> Or should
> I try another language ? Or a completely different path ?
>
> Ideas ?
>
If I had to do this, I think I would write a script outside a web page,
in whatever language able to download files on the web (I'm sure Andrew
will come up with something in Haskell ;-) ). Then of course this script
could generate an HTML page as its output.
Or you could probably do such things in PHP, if you want to have it
driven by a web page consultation.
But I'm no expert in these matters... Never done that before.
--
Vincent
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
> I would imagine you'll have endless problems trying to get round
> security issues if you try to script this from inside a web browser.
>
> My suggestion would be to move to another programming language that has
> a HTTP library and try to do the stuff you want from there. It'll
> probably be much easier.
>
> Obviously I recommend Haskell for this task - and, obviously, you're
> going to say no. ;-)
How did you guess ?
> That being the case, I'm pretty confident that Perl
> / Python / Ruby / Tcl / any of those hackish scripting languages will
> have a library that makes this reasonably easy.
I'm tempted to get a hand on Ruby, for various reasons. Maybe I can
do it in Lisp... At first, I rejected the idea because it would
need AutoCAD, but, no, there might be some free LISP intepreter,
I should check.
In fact, it doesn't even need a http library, if I can shell to
Wget and read the downloaded file...
Fabien.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
4744378d@news.povray.org...
> So I'm still at the start of this seemingly simple project. I'm currently
> thinking
> of getting the pages with WGET, but can I pilot WGET from Javascript ? Or
> should
> I try another language ? Or a completely different path ?
This could be done in PHP, something like this:
<?
$url = 'yoururl';
$needle = 'Architecture';
$haystack = file_get_contents($url);
if(strpos($haystack, $needle)!== false) {
echo 'found';
} else {
echo 'not found';
}
?>
You need to automate the url by passing the right keywords.
G.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Fa3ien wrote:
>> Obviously I recommend Haskell for this task - and, obviously, you're
>> going to say no. ;-)
>
> How did you guess ?
Mmm, because everybody hates Haskell? ;-)
>> That being the case, I'm pretty confident that Perl / Python / Ruby /
>> Tcl / any of those hackish scripting languages will have a library
>> that makes this reasonably easy.
>
> I'm tempted to get a hand on Ruby, for various reasons. Maybe I can
> do it in Lisp... At first, I rejected the idea because it would
> need AutoCAD, but, no, there might be some free LISP intepreter,
> I should check.
I'm pretty sure I looked into this myself, and found that there are
indeed free Common Lisp interpreters out there.
(And there's always emacs... bahahaha!)
> In fact, it doesn't even need a http library, if I can shell to
> Wget and read the downloaded file...
Yeah, that's true. Probably easier this way if there isn't a strong HTTP
library already available. wget will already cover all the important
edge cases...
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
> Fa3ien wrote:
>
>>> Obviously I recommend Haskell for this task - and, obviously, you're
>>> going to say no. ;-)
>>
>> How did you guess ?
>
> Mmm, because everybody hates Haskell? ;-)
Fears, not hate. Personally, whenever you post Haskell code, I'm pretty
admirative of the powerfulness of what you say it does with such concise
code. But I am also scared by the fact that I don't understand a bit
about how what it does relates to what the code looks like.
Since my first message (about 1:30 ago), I tried Ruby. After 20 minutes
of an online hands-on tutorial (http://tryruby.hobix.com/), 5 minutes
checking for "ruby http library", and 5 minutes installing a Ruby
interpreter for Windows, I've been delighted to see that this line of code,
build with my thin newly acquired knowledge :
print Net::HTTP.get(URI.parse("http://www.google.be"))
produced exactly what I expected it to do ! (putting the content
of a web page in a string)
Ruby is a gem !
Fabien.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
I would do it with PHP (outside a webserver), because I did many
scraping scripts that way. It's easy to parse HTML with PHP's DOM and
loadHTML, handles all the bad syntax for you.
I wrote a 300-line PHP script that parsed some old HTML documentation
into wiki formatting, submitted it to the wiki, then created a frameset
page with the original HTML on one side and the new wiki page in edit
mode on the other, and opened the page in Firefox (possible since the
script was running locally).
And another I have running permanently on my system that scrapes the
search page of that wiki searching for spam, and when finding something,
gets the page history (to know the last revision ID) and deletes the
last version (repeatedly for everything listed on the search results).
Even has exponential backoff when finding nothing and a log.
I could give you the source if you want to learn from them :)
If you want to use JS, you can do it if you are on Windows: Windows
Scripting Host. Run it by double-clicking the .js, or using 'wscript
file.js' on the command line (or 'cscript' if you want to WScript.Echo
stuff into the console instead of opening a popup). Remember to get the
XMLHttpRequest using the ActiveX object, 'new XMLHttpRequest' doesn't
work (well, maybe it will if you have IE7).
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Fa3ien wrote:
>> Mmm, because everybody hates Haskell? ;-)
>
> Fears, not hate.
Ah, OK. I rephrase then: *most* people hate Haskell. The rest just ph33r
it. ;-)
> Personally, whenever you post Haskell code, I'm pretty
> admirative of the powerfulness of what you say it does with such concise
> code. But I am also scared by the fact that I don't understand a bit
> about how what it does relates to what the code looks like.
It seems Haskell has both the power to be completely transparent, and
also entirely opaque. A bit like mathematical formulas, really!
> I've been delighted to see that this line of code,
> build with my thin newly acquired knowledge :
>
> print Net::HTTP.get(URI.parse("http://www.google.be"))
>
> produced exactly what I expected it to do ! (putting the content
> of a web page in a string)
Yeah. In Haskell you'd have to spend a few minutes installing GHC, a few
more minutes downloading and compiling the 3rd party HTTP library, and
then you'd have to write something like
let uri = fromMaybe $ parseURI "http://www.google.be"
maybePage <- httpGet uri
let page = fromMaybe maybePage
(And replace those fromMaybe calls with some slightly larger construct
if you actually want to do real error handling.)
> Ruby is a gem !
LOL! I bet you're not the first to think that one up...
Alternatively, if you feel ill, you might try to write the Haskell
version as
page <- (httpGet $ fromMaybe $ parseURI "http://www.google.be") >>=
(return . fromMaybe)
Certainly I can see where the "scary" issue comes from...
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
And lo on Wed, 21 Nov 2007 13:51:37 -0000, Fa3ien
<fab### [at] yourshoesskynetbe> did spake, saying:
<snip>
> I tought 'well, just do some javascript, put the content of the url in
> an iframe, read it, and act accordingly'. Done that. It doesn't work.
> Why ? The XMLHTTPRequest function, which is used to put the content of
> the iframe in a string, is prohibited (in any browser in existence) to
> work with content from another domain. Ouch !
If it's any help I know IE6 didn't have this security restriction, but
that 'hole' may have been plugged now.
> I found some GreaseMonkey script which pretended to allow bypass of this
> "cross-domain policy", but it didn't work.
>
> So I'm still at the start of this seemingly simple project. I'm
> currently thinking of getting the pages with WGET, but can I pilot WGET
> from Javascript ? Or should I try another language ? Or a completely
> different path ?
Depends what you've got to work with and how it's going to be applied. If
you've a PHP server then as Gilles said that's your best bet, otherwise
you're running a 'script' directly.
--
Phil Cook
--
I once tried to be apathetic, but I just couldn't be bothered
http://flipc.blogspot.com
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
|
|