Friday, February 10, 2006

The Geek Event Aggregator is ready!

The Geek Event Aggregator
is more-or-less ready for prime time! It now collects many more events. Better yet, it is very easy to feed more event sources, so it is set for even more growth. In other words,


Some design notes:

After fooling around with complex regular expressions, Beautiful Soup, etc., I found a quick-and-dirty way that works better. The Aggregator downloads a page's HTML, replaces all tags with carriage returns, breaks the remaining text into lines, and checks those lines for ones that appear to contain recognizable future dates. The Aggregator assumes that is an upcoming event date. (There are many reasons dates get put on websites, but future dates almost always refer to meetings or events.)

Wow, human beings have many, many, many ways to write dates. Fortunately, python's dateutil module can recognize most of them - I mainly have to modify that to avoid false hits (like interpreting '.' as 'today', or '2006' alone as 'Jan 1, 2006').

Some sites already aggregate events from several groups and places together; for those, the Aggregator uses a slightly different algorithm. It finds the dates as above, then finds the location and event name by their line-number position relative to the date. (A human being (me) needs to provide relative line-number positions for those values in advance; for instance, one site may always list location on the line immediately after the event date, and that fact is recorded in advance.)

For the multi-event sites, the Aggregator has a decent but kludgey algorithm to parse city and region, despite the great variety of ways to write a location. Part of that relies on a list of recognized city names. It can be used for the single-event sites, too; if a recognizable city name is in the site's Title, or in the text in the form "Blah Blah City Blah Blahware Group", the Aggregator can find it. But if your event is in Athens, GA, the Aggregator thinks it's in Greece.

HTML DB makes it very convenient to build a web interface to the data. Oracle is also very gracious to host a free sandbox for HTML DB projects (which is where the Aggregator lives right now.) Unfortunately, as far as I can see, HTML DB doesn't support RESTful interfaces, or serving up pages as XML. That's a pity, because this cries out to be a REST web service. Maybe eventually I'll buy/find a place to host the web app in TurboGears or something.

For now, I've given up on feeding Upcoming mandates an actual street address, which is just too hard to find automatically. I'd still like to pull from it, although I'll have to check their legal requirements, and - dare I say it? - I don't know if it will really have many relevant events I don't already have.

I'm sorry the events are so U.S./Canada-centric. It only has a handful of events from elsewhere, and it doesn't break down regions within other countries. (What's wrong with going from St. Petersburg to Novosibirsk for a meeting, anyway? Isn't that what the Trans-Siberian Railway is for?) You can help fix this by suggesting new sites to scan, and volunteering to introduce regional granularity for other countries.

Actually, because I've been the only one to feed the Aggregator so far, the events are Ohio-centric. You folks in benighted backwaters like California and New York are just going to have to feed it your own favorite sites if you want to fix that.

Some of the many things that produce misses and false hits:
  • Dates without years. Somebody puts on the website, "Our next meeting is Nov. 19." The Aggregator can't tell that they haven't updated the site since 2004.
  • Years must be on the same line as the rest of the date. If you say,
    2005 events:
    Feb. 14
    Aug. 12
    ... the Aggregator doesn't see the "2005", and believes there are events on Feb. 14 and Aug 12 of this year.
  • Frames. Well, you can't blame it; frames mess up everybody. But if you can dig into the HTML source and puzzle out the URL of the frame with the data, then that can be read. That's what I did for the OKCOUG webpage, for example.


mso said...

Congratulations, you've done a fine work!
Could you tell some details about the interaction between python and HTML DB?

Catherine said...

You touch the most painful part of the work! (And the reason the data hasn't been refreshed this week - I have better data ready, I'm just having trouble getting it uploaded.)

Once the records are inside an Oracle database, publishing them through HTML DB is very easy. And, if this were running on my own machine, getting them into the database would be easy, too; I could issue INSERT statements from within the Python code by using cx_Oracle, or generate an .SQL file of INSERT statements and run that.

Unfortunately, in this hosted-for-free configuration, Oracle doesn't make any direct access to the back-end database available. Data can only be inserted into it through the HTML DB web interface. I have to use that - manually - to import an .xml file with the row data. (And the XML needs to be in a very specific format.)

Generating the .xml file from Python is no problem, but if it is larger than some unspecified size, the import to HTML DB fails without explanation (the import routes to a "404 not found" page). That's totally undocumented, and it's taken me a lot of frustration to get that figured out. So now I need to have the Python code generate a number of XML files, and manually import them into HTML DB one at a time. I'd do this by a script, too, if only HTML DB provided a RESTful interface.

Probably, it's time I rented my own slice of a server somewhere.

dan said...

Event: Catherine being a total, terrific geek.
Date: Sometime like 3 decades ago to a few decades from now.
Place: Varies


This really is pretty cool. I've been wishing that I coudl get something like it for research talks here at UCD.

Sarah W said...

Good idea! I like it!

cheap asp hosting said...

Hey, just a quick hello from someone in Central America.
web page hosting

business web hosting said...

Hey how are you doing? just letting you know that someone from Central America read your blog!
If you feel like visiting mine:
web site hosting

blkdykegoddess said...

excellent resource, thanks!!

me said...

Hi. We've spoken a couple of times at this PyCon and at least encountered each other at previous ones. I'd like to suggest the Yorktown High School Linux Users Group (YHSLUG) calendar, found by searching Google Calendar. The XML feed is here.
YHSLUG is located in Arlington, VA.