POV-Ray: Newsgroups: povray.off-topic: Automation

POV-Ray : Newsgroups : povray.off-topic : Automation		Server Time 29 Jul 2024 18:24:20 EDT (-0400)

Goto Latest 10 Messages

Next 10 Messages >>>

From: Invisible
Subject: Automation
Date: 15 Nov 2011 09:51:38
Message: <4ec27c7a$1@news.povray.org>

A few of you may know what I was in the IDTA Freestyle / Rock & Roll 
National Dance Finals on Saturday. (No, we didn't win anything. Are you 
surprised? Because I'm not...)

The IDTA website itself will tell you nothing unless you are an IDTA 
member (i.e., a dance teacher). However, the website belonging to the 
company that makes the score management software has results listings 
for major tournaments using their software.

That's the good news. The bad news is that this website is completely 
awful. Quite apart from the site being ugly as hell, it uses JavaScript 
hyperlinks, seemingly just to prevent tabbed browsing from working properly.

If you want to look up the results to event 23 in a specific tournament, 
that's easy enough. But if you want to look through all the events 
trying to find competitor's names that you recognise... good luck with 
that. No, there isn't a search function. No, Google does not appear to 
have indexed any of the content. (Presumably Google doesn't follow 
stupid JS links.)

That being the case, I set about writing some software to solve this 
problem.

Step 1: Work out how to download the results tables.

It appears that the results index has a hidden HTML form on it, and 
clicking on a hyperlink runs a small JavaScript function which looks up 
some data from a giant table in the script, and shuffles the data into 
the hidden fields of the hidden form. It then POSTs this form to a Perl 
CGI script, which sends back a badly-formatted HTML page containing the 
data I want.

Mercifully, it turns out the CGI script works perfectly well if you GET 
the form parameters rather than POSTing them. And you don't even need 
/all/ of the parameters, just three main ones.

Step 2: Write a program to download the HTML.

Given the information laboriously reverse-engineered above, I was able 
to quickly throw together some code that generates the URL to fetch an 
HTML page containing the results of each individual contest.

That's the easy bit. I then quickly threw together a trivial HTTP 
client. (Note that a trivial HTTP client is significantly less trivial 
than you might imagine... I suspect mine will break if they reconfigure 
the web server sufficiently!) Now just loop over all the details I want, 
and dump them into HTML files on disk.

(It took me a while to figure out how to turn off persistent 
connections. Using EOF to find the end of the content doesn't work if 
the server leaves the connection open. No, there isn't a content-length 
header from the server.)

Step 3: Write a program to extract meaningful data from the mangled HTML.

This is the "really fun" part. Essentially it takes text and chops it 
where hard-coded search strings appear. If the server admins decide to 
change their markup, the whole thing will break hopelessly.

Fortunately, most of the data is in a big HTML table. Just got to grab 
the columns you want (remembering to skip the empty width=10 column 
that's only there for spacing reasons).

Unfortunately, column 3 contains multiple separate items of data, 
combined by hand into a flat string. And *because* it's formatted by 
hand, 99% of the values in this field follow a simple pattern, and a 
further 1% of them violate this pattern, screwing up the code and 
leading to lots of special-casing.

For the solo events, column 3 contains a 1- or 2-letter area code plus a 
person's name. For the pairs events, it's an area code plus two names 
separated by an ampersand [except on one single row where they are 
separated by the word "and" rather than by "&"]. And for the team 
events, it's the [arbitrary] name of the team. Yes, there is a team 
called "Mammas & Pappas". No, there isn't a reliable way to tell which 
events are team events.

Note also that 99% of the time, there is a space on either side of the 
"&". The other 1% of the time, there isn't. There is also one single row 
which reads "Kate & & John Smith". There is also one person who's name 
is apparently "/laura". Roughly 1% of the time, the person's name is 
followed by "WITHDRAWN". Also, some tables have all the rows padded with 
spaces to constant width, and other tables don't.

In short, this stuff has been keyed in by hand, inconsistently. Which 
makes trying to automatically process it rather fiddly, to say the 
least. Every time you adjust the program, it will process 99% of the 
dataset and then do something stupid with 1% of the records, or maybe 
just one single record. Spotting these mistakes isn't very easy.

Still, in spite of all the problems, I finally ended up writing about 
4KB of Haskell code which manages to produce mostly sane CSV output from 
this crock of junk.

Step 4: Convert the dozens of files into one giant file.

We started with one URL for each event. That became one HTML file for 
each event, and after cooking that's one CSV file for each event. Now we 
loop over all the files, gathering the event titles into one CSV file 
and the table rows into another CSV file. The latter file needs an event 
ID prepended to each table row.

Having done all that, I can import both CSV files into MS Access, 
creating two tables - one mapping event IDs to event titles, and the 
other giving the results for each event. Now I just need to go through 
and correct any final mistakes which remain.

(Somewhere in this process, several people have ended up with obscure 
characters in their names. For example, Mr John - Davis has somehow 

the text is wrong in Access.)

Finally, after all this effort, I can *finally* search the complete 
results for all the competitions of interest by just issuing a few SQL 
statements.

Sheesh... You know there are people who do this kind of crap for a living??

Post a reply to this message

From: Darren New
Subject: Re: Automation
Date: 15 Nov 2011 22:31:12
Message: <4ec32e80$1@news.povray.org>

On 11/15/2011 6:51, Invisible wrote:
> Sheesh... You know there are people who do this kind of crap for a living??

You think that's bad? Imagine the same thing for the iTunes metadata 
database, with like 6 million rows.

It got to the point where I could tell you how many people worked on each 
day's updates, and how many of them were the same person as last time.

Especially fun, since you want "Beatles", "The Beatles", and "Beatles, The" 
to be the same band, while "Alan Parsons" and "Alan Parsons Project" are two 
different bands, and "Synergy" and "Synergy" are also two different bands, 
for example.

Not even counting classical music.

-- 
Darren New, San Diego CA, USA (PST)
   People tell me I am the counter-example.

Post a reply to this message

From: Kevin Wampler
Subject: Re: Automation
Date: 15 Nov 2011 22:49:23
Message: <4ec332c3$1@news.povray.org>

On 11/15/2011 7:31 PM, Darren New wrote:
> Not even counting classical music.

As near as I can tell this problem is solved by not bothering to get it 
right.

Post a reply to this message

From: Invisible
Subject: Re: Automation
Date: 16 Nov 2011 06:33:48
Message: <4ec39f9c$1@news.povray.org>

On 16/11/2011 03:31 AM, Darren New wrote:
> On 11/15/2011 6:51, Invisible wrote:
>> Sheesh... You know there are people who do this kind of crap for a
>> living??
>
> You think that's bad? Imagine the same thing for the iTunes metadata
> database, with like 6 million rows.

If it's anything like CDDB, then that would be why I never use CDDB. :-P

Post a reply to this message

From: Invisible
Subject: Re: Automation
Date: 16 Nov 2011 06:34:37
Message: <4ec39fcd$1@news.povray.org>

On 16/11/2011 03:49 AM, Kevin Wampler wrote:

> As near as I can tell this problem is solved by not bothering to get it
> right.

Ah yes, the Wikipedia approach.

Required XKCD quote: http://www.xkcd.com/978/

Post a reply to this message

From: Invisible
Subject: Re: Automation
Date: 16 Nov 2011 08:22:13
Message: <4ec3b905$1@news.povray.org>

On 15/11/2011 02:51 PM, Invisible wrote:

> Step 2: Write a program to download the HTML.
> Step 3: Write a program to extract meaningful data from the mangled HTML.
> Step 4: Convert the dozens of files into one giant file.

Without labouring the point too much, this is one of the reasons why I 
use Haskell and not C++. It took me about 5 minutes to write a working 
HTTP client, totalling 906 bytes of source code. I shudder to think how 
much time I would have wasted trying to do such a task in C. String 
management becomes vastly easier in C++, but I still wouldn't have a 
clue how to, say, open a socket. [Is there even a portable way of doing 
that? Or would you have to ask the native OS?]

The Haskell code I wrote isn't optimised in the slightest. I could 
easily make it a lot faster, but it's not slow enough to be worth 
bothering. The fact that C or C++ would be faster is a complete non-issue.

Then again, this program is quite atypical of what I write. It's a 
one-off, quick and dirty hack to get me some information. All it really 
does is push data from A to B. So many it's not such a great example to 
hold up.

(If this was something that I wanted to "work properly", I would of 
course have used a real HTTP library, and a real HTML parser library, 
and so on. As it is, quickly throwing something together that has the 
limited functions that I need right now was easier than trying to learn 
how to use a new library.)

Of course, now somebody is going to point out that in Perl, the entire 
program would have been 3 lines. But I'm not using Perl unless somebody 
pays me. :-P

Post a reply to this message

From: nemesis
Subject: Re: Automation
Date: 16 Nov 2011 09:00:00
Message: <web.4ec3c0ce1924b9409a1bcfb90@news.povray.org>

Invisible <voi### [at] devnull> wrote:
> On 15/11/2011 02:51 PM, Invisible wrote:
>
> > Step 2: Write a program to download the HTML.
> > Step 3: Write a program to extract meaningful data from the mangled HTML.
> > Step 4: Convert the dozens of files into one giant file.
>
> Of course, now somebody is going to point out that in Perl, the entire
> program would have been 3 lines. But I'm not using Perl unless somebody
> pays me. :-P

Good you at least know that! :)

Besides builtin http clients being pretty much a given in any modern programming
environment (including C++ boost if I'm not wrong), there's also command-line
tools such as wget.  No need to write a half-baked unoptmized attempt these
days. :)

Post a reply to this message

From: Invisible
Subject: Re: Automation
Date: 16 Nov 2011 09:11:12
Message: <4ec3c480$1@news.povray.org>

On 16/11/2011 01:55 PM, nemesis wrote:

> Besides builtin http clients being pretty much a given in any modern programming
> environment (including C++ boost if I'm not wrong),

It appears you're wrong. Boost doesn't seem to include any HTTP 
functionality that I can see.

> there's also command-line tools such as wget.

Amusingly, Haskell actually has a libcurl binding. Guess what? Doesn't 
work on Windows. :-P

I hadn't thought about trying wget though - that probably wouldn't be 
too hard. (The tricky part is running an external OS process and then 
trying to figure out whether the **** it actually worked OK or not.) 
Given that all I want to do is generate 30 URLs and have each one 
downloaded to a file, that probably wouldn't be too hard.

> No need to write a half-baked unoptmized attempt these days. :)

Downloading 30 files, of 2KB each, is not exactly something that 
requires "optimising".

The text processing that happens after that? I could have made it a lot 
faster. But since I only need to run it once, I didn't bother.

Post a reply to this message

From: Kevin Wampler
Subject: Re: Automation
Date: 16 Nov 2011 16:49:46
Message: <4ec42ffa$1@news.povray.org>

On 11/16/2011 5:22 AM, Invisible wrote:
> Without labouring the point too much

Trust me, that ship has sailed.

Post a reply to this message

From: Invisible
Subject: Re: Automation
Date: 17 Nov 2011 06:07:15
Message: <4ec4eae3@news.povray.org>

On 16/11/2011 09:49 PM, Kevin Wampler wrote:
> On 11/16/2011 5:22 AM, Invisible wrote:
>> Without labouring the point too much
>
> Trust me, that ship has sailed.

I meant more that this *specific* example perhaps isn't the best one. 
[Which is why I'm not going to go on about it too much.]

Post a reply to this message

Goto Latest 10 Messages

Next 10 Messages >>>