|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
A few of you may know what I was in the IDTA Freestyle / Rock & Roll
National Dance Finals on Saturday. (No, we didn't win anything. Are you
surprised? Because I'm not...)
The IDTA website itself will tell you nothing unless you are an IDTA
member (i.e., a dance teacher). However, the website belonging to the
company that makes the score management software has results listings
for major tournaments using their software.
That's the good news. The bad news is that this website is completely
awful. Quite apart from the site being ugly as hell, it uses JavaScript
hyperlinks, seemingly just to prevent tabbed browsing from working properly.
If you want to look up the results to event 23 in a specific tournament,
that's easy enough. But if you want to look through all the events
trying to find competitor's names that you recognise... good luck with
that. No, there isn't a search function. No, Google does not appear to
have indexed any of the content. (Presumably Google doesn't follow
stupid JS links.)
That being the case, I set about writing some software to solve this
problem.
Step 1: Work out how to download the results tables.
It appears that the results index has a hidden HTML form on it, and
clicking on a hyperlink runs a small JavaScript function which looks up
some data from a giant table in the script, and shuffles the data into
the hidden fields of the hidden form. It then POSTs this form to a Perl
CGI script, which sends back a badly-formatted HTML page containing the
data I want.
Mercifully, it turns out the CGI script works perfectly well if you GET
the form parameters rather than POSTing them. And you don't even need
/all/ of the parameters, just three main ones.
Step 2: Write a program to download the HTML.
Given the information laboriously reverse-engineered above, I was able
to quickly throw together some code that generates the URL to fetch an
HTML page containing the results of each individual contest.
That's the easy bit. I then quickly threw together a trivial HTTP
client. (Note that a trivial HTTP client is significantly less trivial
than you might imagine... I suspect mine will break if they reconfigure
the web server sufficiently!) Now just loop over all the details I want,
and dump them into HTML files on disk.
(It took me a while to figure out how to turn off persistent
connections. Using EOF to find the end of the content doesn't work if
the server leaves the connection open. No, there isn't a content-length
header from the server.)
Step 3: Write a program to extract meaningful data from the mangled HTML.
This is the "really fun" part. Essentially it takes text and chops it
where hard-coded search strings appear. If the server admins decide to
change their markup, the whole thing will break hopelessly.
Fortunately, most of the data is in a big HTML table. Just got to grab
the columns you want (remembering to skip the empty width=10 column
that's only there for spacing reasons).
Unfortunately, column 3 contains multiple separate items of data,
combined by hand into a flat string. And *because* it's formatted by
hand, 99% of the values in this field follow a simple pattern, and a
further 1% of them violate this pattern, screwing up the code and
leading to lots of special-casing.
For the solo events, column 3 contains a 1- or 2-letter area code plus a
person's name. For the pairs events, it's an area code plus two names
separated by an ampersand [except on one single row where they are
separated by the word "and" rather than by "&"]. And for the team
events, it's the [arbitrary] name of the team. Yes, there is a team
called "Mammas & Pappas". No, there isn't a reliable way to tell which
events are team events.
Note also that 99% of the time, there is a space on either side of the
"&". The other 1% of the time, there isn't. There is also one single row
which reads "Kate & & John Smith". There is also one person who's name
is apparently "/laura". Roughly 1% of the time, the person's name is
followed by "WITHDRAWN". Also, some tables have all the rows padded with
spaces to constant width, and other tables don't.
In short, this stuff has been keyed in by hand, inconsistently. Which
makes trying to automatically process it rather fiddly, to say the
least. Every time you adjust the program, it will process 99% of the
dataset and then do something stupid with 1% of the records, or maybe
just one single record. Spotting these mistakes isn't very easy.
Still, in spite of all the problems, I finally ended up writing about
4KB of Haskell code which manages to produce mostly sane CSV output from
this crock of junk.
Step 4: Convert the dozens of files into one giant file.
We started with one URL for each event. That became one HTML file for
each event, and after cooking that's one CSV file for each event. Now we
loop over all the files, gathering the event titles into one CSV file
and the table rows into another CSV file. The latter file needs an event
ID prepended to each table row.
Having done all that, I can import both CSV files into MS Access,
creating two tables - one mapping event IDs to event titles, and the
other giving the results for each event. Now I just need to go through
and correct any final mistakes which remain.
(Somewhere in this process, several people have ended up with obscure
characters in their names. For example, Mr John - Davis has somehow
the text is wrong in Access.)
Finally, after all this effort, I can *finally* search the complete
results for all the competitions of interest by just issuing a few SQL
statements.
Sheesh... You know there are people who do this kind of crap for a living??
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 11/15/2011 6:51, Invisible wrote:
> Sheesh... You know there are people who do this kind of crap for a living??
You think that's bad? Imagine the same thing for the iTunes metadata
database, with like 6 million rows.
It got to the point where I could tell you how many people worked on each
day's updates, and how many of them were the same person as last time.
Especially fun, since you want "Beatles", "The Beatles", and "Beatles, The"
to be the same band, while "Alan Parsons" and "Alan Parsons Project" are two
different bands, and "Synergy" and "Synergy" are also two different bands,
for example.
Not even counting classical music.
--
Darren New, San Diego CA, USA (PST)
People tell me I am the counter-example.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 11/15/2011 7:31 PM, Darren New wrote:
> Not even counting classical music.
As near as I can tell this problem is solved by not bothering to get it
right.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 16/11/2011 03:31 AM, Darren New wrote:
> On 11/15/2011 6:51, Invisible wrote:
>> Sheesh... You know there are people who do this kind of crap for a
>> living??
>
> You think that's bad? Imagine the same thing for the iTunes metadata
> database, with like 6 million rows.
If it's anything like CDDB, then that would be why I never use CDDB. :-P
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 16/11/2011 03:49 AM, Kevin Wampler wrote:
> As near as I can tell this problem is solved by not bothering to get it
> right.
Ah yes, the Wikipedia approach.
Required XKCD quote: http://www.xkcd.com/978/
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 15/11/2011 02:51 PM, Invisible wrote:
> Step 2: Write a program to download the HTML.
> Step 3: Write a program to extract meaningful data from the mangled HTML.
> Step 4: Convert the dozens of files into one giant file.
Without labouring the point too much, this is one of the reasons why I
use Haskell and not C++. It took me about 5 minutes to write a working
HTTP client, totalling 906 bytes of source code. I shudder to think how
much time I would have wasted trying to do such a task in C. String
management becomes vastly easier in C++, but I still wouldn't have a
clue how to, say, open a socket. [Is there even a portable way of doing
that? Or would you have to ask the native OS?]
The Haskell code I wrote isn't optimised in the slightest. I could
easily make it a lot faster, but it's not slow enough to be worth
bothering. The fact that C or C++ would be faster is a complete non-issue.
Then again, this program is quite atypical of what I write. It's a
one-off, quick and dirty hack to get me some information. All it really
does is push data from A to B. So many it's not such a great example to
hold up.
(If this was something that I wanted to "work properly", I would of
course have used a real HTTP library, and a real HTML parser library,
and so on. As it is, quickly throwing something together that has the
limited functions that I need right now was easier than trying to learn
how to use a new library.)
Of course, now somebody is going to point out that in Perl, the entire
program would have been 3 lines. But I'm not using Perl unless somebody
pays me. :-P
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Invisible <voi### [at] devnull> wrote:
> On 15/11/2011 02:51 PM, Invisible wrote:
>
> > Step 2: Write a program to download the HTML.
> > Step 3: Write a program to extract meaningful data from the mangled HTML.
> > Step 4: Convert the dozens of files into one giant file.
>
> Of course, now somebody is going to point out that in Perl, the entire
> program would have been 3 lines. But I'm not using Perl unless somebody
> pays me. :-P
Good you at least know that! :)
Besides builtin http clients being pretty much a given in any modern programming
environment (including C++ boost if I'm not wrong), there's also command-line
tools such as wget. No need to write a half-baked unoptmized attempt these
days. :)
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 16/11/2011 01:55 PM, nemesis wrote:
> Besides builtin http clients being pretty much a given in any modern programming
> environment (including C++ boost if I'm not wrong),
It appears you're wrong. Boost doesn't seem to include any HTTP
functionality that I can see.
> there's also command-line tools such as wget.
Amusingly, Haskell actually has a libcurl binding. Guess what? Doesn't
work on Windows. :-P
I hadn't thought about trying wget though - that probably wouldn't be
too hard. (The tricky part is running an external OS process and then
trying to figure out whether the **** it actually worked OK or not.)
Given that all I want to do is generate 30 URLs and have each one
downloaded to a file, that probably wouldn't be too hard.
> No need to write a half-baked unoptmized attempt these days. :)
Downloading 30 files, of 2KB each, is not exactly something that
requires "optimising".
The text processing that happens after that? I could have made it a lot
faster. But since I only need to run it once, I didn't bother.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 11/16/2011 5:22 AM, Invisible wrote:
> Without labouring the point too much
Trust me, that ship has sailed.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 16/11/2011 09:49 PM, Kevin Wampler wrote:
> On 11/16/2011 5:22 AM, Invisible wrote:
>> Without labouring the point too much
>
> Trust me, that ship has sailed.
I meant more that this *specific* example perhaps isn't the best one.
[Which is why I'm not going to go on about it too much.]
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
|
|