I've almost closed bug #106. I've done all I can do for now, until I
can figure out how to attribute actions to senators. Until then, I
should comment out the 'actions' tab tomorrow with a note letting
whoever tries this next what I've already tried.
There is still some work to do in this code, particularly with fixing
the hacky Assembly scraping I wrote last year which also broke. However,
considering that the Assembly is not currently one of the bits we are
trying to expose (and they don't have a nice public API like the
Senate), it probably won't get done for a little while.
When I left yesterday, I had exposed the senator's social page, but none
of the other tabs were showing up. The problem for this turned out to be
that the image representing the tab was corrupted and the text beneath
it was white, making it overall look like the tab was invisible. Once I
got a new image for the tab, they all suddenly appeared, though only the
bills had any information.
The next thing to fix was the scraping of the committees, whose pages
had also changed subtly in the past few months. Fixing this was much
less annoying than building them the first time, and I was able to
remove a lot of old shims in the code from when I was first being
introduced to BeautifulSoup. Some of my later changes made this
particularly easy, and since BeautifulSoup is so powerful, I was able to
restore access to the data with relative ease. As a bonus, as soon as I
had committees back up, the tabs for votes and meetings came with it for
free. Suddenly, I was almost there!
The final problem I encountered today is actually not a new one, but one
we struggled with last year, and one I now feel confident I have
actually fixed and understand now. Once I got all the data pulling
again, a few of the pages would crash the server with
UnicodeDecodeErrors.
As a warning, some heavy Python is about to come down
UnicodeDecodeError is an error which happens (generally) when attempting
to decode a string into a Unicode object. This is generally a great
thing to do, as Python strings are generally encoded in the relatively
restrictive ASCII, which does not have characters for any of the more
exciting characters like accents and non-latin symbols. Unicode has no
such restrictions, and indeed has data for many, many more symbols, at
the cost of a few more bits of storage per character.
So why were we getting this error? The relevant line of code was 's =
unicode(s)' and was contained within WebOb code, not something I was
going to be able to modify successfully. Still, even this shouldn't be a
problem. The purpose of this function is to turn strings to Unicode
strings.
Except I didn't have a string, I already had a Unicode string.
Even this shouldn't be a problem, except that unicode() tries to
interpret its input as a string and then turn it into a Unicode string.
And while Unicode strings can be easily represented as normal strings,
the default of the unicode() function is to try to interpret those
strings as ASCII strings, and I had accents in the strings. These
strings were representing the names of the senators, so I had to make
sure it came out right.
In order to solve this, I had to reversibly represent these names as a
sequence of ASCII characters.
There are a few ways to replace out-of-bound characters when changing
strings into lesser encodings, and I had two useful ones to chose from.
The obvious one to chose was to change the characters into XML character
entities, however this quickly turned out to be insufficient. While
é correctly showed up as é on the page, this string is used to
represent the name everywhere, including in the internal URL
representing the page. And the ampersand was quickly stripped out as a
broken argument to the URL, leading to a page for a nonexistent Senator.
Looking through the code, there were three distinct uses for the name
string. The first, which had started all this, was as an ASCII key to a
dictionary which needed to be authoritative but not necessarily
accurate. In other words, I needed it to be the same everywhere, but it
didn't necessarily need to be the correct name of the senator. The
second was the use on the generated web page, which needed to be as
accurate as possible to the Senator's actual name, as it is going to be
viewed publicly. The third, and the current stickler was the name in the
URL. Again, this had to be authoritative but not necessarily accurate.
This one, however, had to also only include web-safe characters, of
which &, # and ; do not qualify.
I mulled this over for a while, thinking up more and more elaborate
schemes for intercepting the names before they reached critical areas,
but none of it was terribly good coding practice. After far too much
thinking, I realized the obvious answer: have separate internal and
external names. The system still relies on the senator's name, which is
still a questionable practice given the multiple spellings of names that
occasionally pop up, (but mostly because I remember this post, which
is something you should always keep in mind when programming around
names. The display name, on the other hand, has none of the restrictions
on characters (though it still needs to be ASCII to display properly),
but by using XML entities, we can make any character we want without
problems.
This was a long path to take to get back to where we were, but I think
that I really understand Python's Unicode in a way I never grasped
before. This should definitely help in the future as Unicode is a very
important part of coding portable applications and that's something I
want to do.