POV-Ray: Newsgroups: povray.off-topic: Automation: Automation

POV-Ray : Newsgroups : povray.off-topic : Automation : Automation		Server Time 15 Jul 2025 16:59:04 EDT (-0400)
From: Invisible
Date: 15 Nov 2011 09:51:38
Message: <4ec27c7a$1@news.povray.org>
A few of you may know what I was in the IDTA Freestyle / Rock & Roll 
National Dance Finals on Saturday. (No, we didn't win anything. Are you 
surprised? Because I'm not...)

The IDTA website itself will tell you nothing unless you are an IDTA 
member (i.e., a dance teacher). However, the website belonging to the 
company that makes the score management software has results listings 
for major tournaments using their software.

That's the good news. The bad news is that this website is completely 
awful. Quite apart from the site being ugly as hell, it uses JavaScript 
hyperlinks, seemingly just to prevent tabbed browsing from working properly.

If you want to look up the results to event 23 in a specific tournament, 
that's easy enough. But if you want to look through all the events 
trying to find competitor's names that you recognise... good luck with 
that. No, there isn't a search function. No, Google does not appear to 
have indexed any of the content. (Presumably Google doesn't follow 
stupid JS links.)

That being the case, I set about writing some software to solve this 
problem.

Step 1: Work out how to download the results tables.

It appears that the results index has a hidden HTML form on it, and 
clicking on a hyperlink runs a small JavaScript function which looks up 
some data from a giant table in the script, and shuffles the data into 
the hidden fields of the hidden form. It then POSTs this form to a Perl 
CGI script, which sends back a badly-formatted HTML page containing the 
data I want.

Mercifully, it turns out the CGI script works perfectly well if you GET 
the form parameters rather than POSTing them. And you don't even need 
/all/ of the parameters, just three main ones.

Step 2: Write a program to download the HTML.

Given the information laboriously reverse-engineered above, I was able 
to quickly throw together some code that generates the URL to fetch an 
HTML page containing the results of each individual contest.

That's the easy bit. I then quickly threw together a trivial HTTP 
client. (Note that a trivial HTTP client is significantly less trivial 
than you might imagine... I suspect mine will break if they reconfigure 
the web server sufficiently!) Now just loop over all the details I want, 
and dump them into HTML files on disk.

(It took me a while to figure out how to turn off persistent 
connections. Using EOF to find the end of the content doesn't work if 
the server leaves the connection open. No, there isn't a content-length 
header from the server.)

Step 3: Write a program to extract meaningful data from the mangled HTML.

This is the "really fun" part. Essentially it takes text and chops it 
where hard-coded search strings appear. If the server admins decide to 
change their markup, the whole thing will break hopelessly.

Fortunately, most of the data is in a big HTML table. Just got to grab 
the columns you want (remembering to skip the empty width=10 column 
that's only there for spacing reasons).

Unfortunately, column 3 contains multiple separate items of data, 
combined by hand into a flat string. And *because* it's formatted by 
hand, 99% of the values in this field follow a simple pattern, and a 
further 1% of them violate this pattern, screwing up the code and 
leading to lots of special-casing.

For the solo events, column 3 contains a 1- or 2-letter area code plus a 
person's name. For the pairs events, it's an area code plus two names 
separated by an ampersand [except on one single row where they are 
separated by the word "and" rather than by "&"]. And for the team 
events, it's the [arbitrary] name of the team. Yes, there is a team 
called "Mammas & Pappas". No, there isn't a reliable way to tell which 
events are team events.

Note also that 99% of the time, there is a space on either side of the 
"&". The other 1% of the time, there isn't. There is also one single row 
which reads "Kate & & John Smith". There is also one person who's name 
is apparently "/laura". Roughly 1% of the time, the person's name is 
followed by "WITHDRAWN". Also, some tables have all the rows padded with 
spaces to constant width, and other tables don't.

In short, this stuff has been keyed in by hand, inconsistently. Which 
makes trying to automatically process it rather fiddly, to say the 
least. Every time you adjust the program, it will process 99% of the 
dataset and then do something stupid with 1% of the records, or maybe 
just one single record. Spotting these mistakes isn't very easy.

Still, in spite of all the problems, I finally ended up writing about 
4KB of Haskell code which manages to produce mostly sane CSV output from 
this crock of junk.

Step 4: Convert the dozens of files into one giant file.

We started with one URL for each event. That became one HTML file for 
each event, and after cooking that's one CSV file for each event. Now we 
loop over all the files, gathering the event titles into one CSV file 
and the table rows into another CSV file. The latter file needs an event 
ID prepended to each table row.

Having done all that, I can import both CSV files into MS Access, 
creating two tables - one mapping event IDs to event titles, and the 
other giving the results for each event. Now I just need to go through 
and correct any final mistakes which remain.

(Somewhere in this process, several people have ended up with obscure 
characters in their names. For example, Mr John - Davis has somehow 

the text is wrong in Access.)

Finally, after all this effort, I can *finally* search the complete 
results for all the competitions of interest by just issuing a few SQL 
statements.

Sheesh... You know there are people who do this kind of crap for a living??
Post a reply to this message