The Geek Event Aggregator is more-or-less ready for prime time! It now collects many more events. Better yet, it is very easy to feed more event sources, so it is set for even more growth. In other words,
PLEASE FEED THE AGGREGATOR NEW SITE SUGGESTIONS!
Some design notes:
After fooling around with complex regular expressions, Beautiful Soup, etc., I found a quick-and-dirty way that works better. The Aggregator downloads a page's HTML, replaces all tags with carriage returns, breaks the remaining text into lines, and checks those lines for ones that appear to contain recognizable future dates. The Aggregator assumes that is an upcoming event date. (There are many reasons dates get put on websites, but future dates almost always refer to meetings or events.)
Wow, human beings have many, many, many ways to write dates. Fortunately, python's dateutil module can recognize most of them - I mainly have to modify that to avoid false hits (like interpreting '.' as 'today', or '2006' alone as 'Jan 1, 2006').
Some sites already aggregate events from several groups and places together; for those, the Aggregator uses a slightly different algorithm. It finds the dates as above, then finds the location and event name by their line-number position relative to the date. (A human being (me) needs to provide relative line-number positions for those values in advance; for instance, one site may always list location on the line immediately after the event date, and that fact is recorded in advance.)
For the multi-event sites, the Aggregator has a decent but kludgey algorithm to parse city and region, despite the great variety of ways to write a location. Part of that relies on a list of recognized city names. It can be used for the single-event sites, too; if a recognizable city name is in the site's Title, or in the text in the form "Blah Blah City Blah Blahware Group", the Aggregator can find it. But if your event is in Athens, GA, the Aggregator thinks it's in Greece.
HTML DB makes it very convenient to build a web interface to the data. Oracle is also very gracious to host a free sandbox for HTML DB projects (which is where the Aggregator lives right now.) Unfortunately, as far as I can see, HTML DB doesn't support RESTful interfaces, or serving up pages as XML. That's a pity, because this cries out to be a REST web service. Maybe eventually I'll buy/find a place to host the web app in TurboGears or something.
For now, I've given up on feeding upcoming.org. Upcoming mandates an actual street address, which is just too hard to find automatically. I'd still like to pull from it, although I'll have to check their legal requirements, and - dare I say it? - I don't know if it will really have many relevant events I don't already have.
I'm sorry the events are so U.S./Canada-centric. It only has a handful of events from elsewhere, and it doesn't break down regions within other countries. (What's wrong with going from St. Petersburg to Novosibirsk for a meeting, anyway? Isn't that what the Trans-Siberian Railway is for?) You can help fix this by suggesting new sites to scan, and volunteering to introduce regional granularity for other countries.
Actually, because I've been the only one to feed the Aggregator so far, the events are Ohio-centric. You folks in benighted backwaters like California and New York are just going to have to feed it your own favorite sites if you want to fix that.
Some of the many things that produce misses and false hits:
- Dates without years. Somebody puts on the website, "Our next meeting is Nov. 19." The Aggregator can't tell that they haven't updated the site since 2004.
- Years must be on the same line as the rest of the date. If you say,
... the Aggregator doesn't see the "2005", and believes there are events on Feb. 14 and Aug 12 of this year.
- Frames. Well, you can't blame it; frames mess up everybody. But if you can dig into the HTML source and puzzle out the URL of the frame with the data, then that can be read. That's what I did for the OKCOUG webpage, for example.