|
![](/i/fill.gif) |
A few of you may know what I was in the IDTA Freestyle / Rock & Roll
National Dance Finals on Saturday. (No, we didn't win anything. Are you
surprised? Because I'm not...)
The IDTA website itself will tell you nothing unless you are an IDTA
member (i.e., a dance teacher). However, the website belonging to the
company that makes the score management software has results listings
for major tournaments using their software.
That's the good news. The bad news is that this website is completely
awful. Quite apart from the site being ugly as hell, it uses JavaScript
hyperlinks, seemingly just to prevent tabbed browsing from working properly.
If you want to look up the results to event 23 in a specific tournament,
that's easy enough. But if you want to look through all the events
trying to find competitor's names that you recognise... good luck with
that. No, there isn't a search function. No, Google does not appear to
have indexed any of the content. (Presumably Google doesn't follow
stupid JS links.)
That being the case, I set about writing some software to solve this
problem.
Step 1: Work out how to download the results tables.
It appears that the results index has a hidden HTML form on it, and
clicking on a hyperlink runs a small JavaScript function which looks up
some data from a giant table in the script, and shuffles the data into
the hidden fields of the hidden form. It then POSTs this form to a Perl
CGI script, which sends back a badly-formatted HTML page containing the
data I want.
Mercifully, it turns out the CGI script works perfectly well if you GET
the form parameters rather than POSTing them. And you don't even need
/all/ of the parameters, just three main ones.
Step 2: Write a program to download the HTML.
Given the information laboriously reverse-engineered above, I was able
to quickly throw together some code that generates the URL to fetch an
HTML page containing the results of each individual contest.
That's the easy bit. I then quickly threw together a trivial HTTP
client. (Note that a trivial HTTP client is significantly less trivial
than you might imagine... I suspect mine will break if they reconfigure
the web server sufficiently!) Now just loop over all the details I want,
and dump them into HTML files on disk.
(It took me a while to figure out how to turn off persistent
connections. Using EOF to find the end of the content doesn't work if
the server leaves the connection open. No, there isn't a content-length
header from the server.)
Step 3: Write a program to extract meaningful data from the mangled HTML.
This is the "really fun" part. Essentially it takes text and chops it
where hard-coded search strings appear. If the server admins decide to
change their markup, the whole thing will break hopelessly.
Fortunately, most of the data is in a big HTML table. Just got to grab
the columns you want (remembering to skip the empty width=10 column
that's only there for spacing reasons).
Unfortunately, column 3 contains multiple separate items of data,
combined by hand into a flat string. And *because* it's formatted by
hand, 99% of the values in this field follow a simple pattern, and a
further 1% of them violate this pattern, screwing up the code and
leading to lots of special-casing.
For the solo events, column 3 contains a 1- or 2-letter area code plus a
person's name. For the pairs events, it's an area code plus two names
separated by an ampersand [except on one single row where they are
separated by the word "and" rather than by "&"]. And for the team
events, it's the [arbitrary] name of the team. Yes, there is a team
called "Mammas & Pappas". No, there isn't a reliable way to tell which
events are team events.
Note also that 99% of the time, there is a space on either side of the
"&". The other 1% of the time, there isn't. There is also one single row
which reads "Kate & & John Smith". There is also one person who's name
is apparently "/laura". Roughly 1% of the time, the person's name is
followed by "WITHDRAWN". Also, some tables have all the rows padded with
spaces to constant width, and other tables don't.
In short, this stuff has been keyed in by hand, inconsistently. Which
makes trying to automatically process it rather fiddly, to say the
least. Every time you adjust the program, it will process 99% of the
dataset and then do something stupid with 1% of the records, or maybe
just one single record. Spotting these mistakes isn't very easy.
Still, in spite of all the problems, I finally ended up writing about
4KB of Haskell code which manages to produce mostly sane CSV output from
this crock of junk.
Step 4: Convert the dozens of files into one giant file.
We started with one URL for each event. That became one HTML file for
each event, and after cooking that's one CSV file for each event. Now we
loop over all the files, gathering the event titles into one CSV file
and the table rows into another CSV file. The latter file needs an event
ID prepended to each table row.
Having done all that, I can import both CSV files into MS Access,
creating two tables - one mapping event IDs to event titles, and the
other giving the results for each event. Now I just need to go through
and correct any final mistakes which remain.
(Somewhere in this process, several people have ended up with obscure
characters in their names. For example, Mr John - Davis has somehow
the text is wrong in Access.)
Finally, after all this effort, I can *finally* search the complete
results for all the competitions of interest by just issuing a few SQL
statements.
Sheesh... You know there are people who do this kind of crap for a living??
Post a reply to this message
|
![](/i/fill.gif) |