Friday, February 17, 2006

Python Cookbook

This may seem silly, but I really have to sing the praises of the Python Cookbook, 2nd Edition. Everybody else has known about it forever, but I just got my copy a month ago.

I am blown away. I can't believe how good this book is, far beyond any other programming book I've ever known. I'm now completely embarrassed about the quality of the code I wrote without it, and tempted to stay up all night refactoring everything. In fact, I've already largely rewritten sqlWrap.py based on what I've learned.

I'm afraid I dawdled about buying my copy because I never found the recipes at ActiveState all that compelling - occasionally nice, but usually nothing to jump and scream about. I figured the Cookbook would just be a bunch of them, bound together. Not so - the careful selection and excellent discussion make it amazing. It's like having a master programmer at your elbow to guide you.

Friday, February 10, 2006

The Geek Event Aggregator is ready!


The Geek Event Aggregator
is more-or-less ready for prime time! It now collects many more events. Better yet, it is very easy to feed more event sources, so it is set for even more growth. In other words,

PLEASE FEED THE AGGREGATOR NEW SITE SUGGESTIONS!

Some design notes:

After fooling around with complex regular expressions, Beautiful Soup, etc., I found a quick-and-dirty way that works better. The Aggregator downloads a page's HTML, replaces all tags with carriage returns, breaks the remaining text into lines, and checks those lines for ones that appear to contain recognizable future dates. The Aggregator assumes that is an upcoming event date. (There are many reasons dates get put on websites, but future dates almost always refer to meetings or events.)

Wow, human beings have many, many, many ways to write dates. Fortunately, python's dateutil module can recognize most of them - I mainly have to modify that to avoid false hits (like interpreting '.' as 'today', or '2006' alone as 'Jan 1, 2006').

Some sites already aggregate events from several groups and places together; for those, the Aggregator uses a slightly different algorithm. It finds the dates as above, then finds the location and event name by their line-number position relative to the date. (A human being (me) needs to provide relative line-number positions for those values in advance; for instance, one site may always list location on the line immediately after the event date, and that fact is recorded in advance.)

For the multi-event sites, the Aggregator has a decent but kludgey algorithm to parse city and region, despite the great variety of ways to write a location. Part of that relies on a list of recognized city names. It can be used for the single-event sites, too; if a recognizable city name is in the site's Title, or in the text in the form "Blah Blah City Blah Blahware Group", the Aggregator can find it. But if your event is in Athens, GA, the Aggregator thinks it's in Greece.

HTML DB makes it very convenient to build a web interface to the data. Oracle is also very gracious to host a free sandbox for HTML DB projects (which is where the Aggregator lives right now.) Unfortunately, as far as I can see, HTML DB doesn't support RESTful interfaces, or serving up pages as XML. That's a pity, because this cries out to be a REST web service. Maybe eventually I'll buy/find a place to host the web app in TurboGears or something.

For now, I've given up on feeding upcoming.org. Upcoming mandates an actual street address, which is just too hard to find automatically. I'd still like to pull from it, although I'll have to check their legal requirements, and - dare I say it? - I don't know if it will really have many relevant events I don't already have.

I'm sorry the events are so U.S./Canada-centric. It only has a handful of events from elsewhere, and it doesn't break down regions within other countries. (What's wrong with going from St. Petersburg to Novosibirsk for a meeting, anyway? Isn't that what the Trans-Siberian Railway is for?) You can help fix this by suggesting new sites to scan, and volunteering to introduce regional granularity for other countries.

Actually, because I've been the only one to feed the Aggregator so far, the events are Ohio-centric. You folks in benighted backwaters like California and New York are just going to have to feed it your own favorite sites if you want to fix that.

Some of the many things that produce misses and false hits:
  • Dates without years. Somebody puts on the website, "Our next meeting is Nov. 19." The Aggregator can't tell that they haven't updated the site since 2004.
  • Years must be on the same line as the rest of the date. If you say,
    2005 events:
    Feb. 14
    Aug. 12
    ... the Aggregator doesn't see the "2005", and believes there are events on Feb. 14 and Aug 12 of this year.
  • Frames. Well, you can't blame it; frames mess up everybody. But if you can dig into the HTML source and puzzle out the URL of the frame with the data, then that can be read. That's what I did for the OKCOUG webpage, for example.

Thursday, February 02, 2006

Geek Event Aggregator's future

The strangest thing happened to me this morning. The clouds split open, and a beam of light shone down onto me. (Since I was sitting in my cubicle, this in itself was odd enough.) A Voice spoke from Heaven, and said,

"Catherine, remember that somewhat cheesy Geek Event Aggregator you put on HTML DB a while ago?"

"Yes, Lord?" I said. (That seemed like the obvious response. That the Almighty would speak with hyperlinks didn't seem too surprising.)

"Well, I have heard the cries of my people. I want you to rewrite your aggregator to both consume and provide events in iCalendar format. Also, interface it with upcoming.org - both to download and to upload. Behold, Python libraries for upcoming.org have already been written for you. And I'm thinking that Beautiful Soup might help you pluck event descriptions from the God-awful HTML jungles you find them in."

"It sounds wonderful, Lord," I said. "But when am I supposed to do this? I mean, you did just drop custody of our Godson into our lap. Time is not exactly abundant right now."

There was a long pause, then the Voice said, "Let me get back to you on that."