Tuesday, September 02, 2008

BigTable blues

This was supposed to be the blog entry where I would announce the Geek Event Aggregator's successful port to Google App Engine.

(sigh)

I've read an awful lot of buzz about how GAE's BigTable is the Next Big Thing in data, makes RDBMS obsolete, etc. Maybe I'm just doing it wrong, but right now I am utterly unimpressed.

The Geek Event Aggregator needs to search its database of 5000 or so events for events whose longitude and latitude are close enough to the user to be of interest. Does that sound so impossible?

I couldn't do it in GAE. First, "Inequality Filters Are Allowed On One Property Only" - so I can filter for longitude or latitude, but not both. I had to filter only for longitude, pull all resulting records into the application, and finish boiling the ocean in my app. It was slow, in the local application environment, but I hoped it would run faster once uploaded to the actual GAE production servers.

In production, though, it doesn't run at all - "Timeout: datastore timeout: operation took too long.". Querying from 5000 records - too much for the mighty BigTable, apparently. Dropping the filters on longitude (to do all the filtering in the app, in case inequality filtering is just so poisonous) didn't help, either.

Oh well. I still enjoyed working with GAE at first, and maybe I'll use it again for something with very light data demands. For the Geek Event Aggregator, I do have a server available where I can host in TurboGears - it'll just take a bit of rewriting. Later this week, hopefully.

8 comments:

Etienne said...

Consider having a look at: http://geohash.org/
It lets you encode Lat and Long into a simple string which has the nice property of sorting places close to each other when you do a simple alphabetical sort of the list of strings.

And the GAE is funky yes, but you need to apprach things from a different angle. Difficult to forget all the RDBMS habits, but you have to try.

Jay A. said...

Imagine you arrive at a stationary store and the clerk tells you. "OK miss, please wait just a bit while I check against my whole inventory of 5,000 products to find those 2 or 3 that match your request."

Sounds stupid in the real world, doesn't it?

But programmers got so used to this sort of approach because of RDBMS that anything that doesn't fit this model seems deficient.

The bottom line: Refactory your model, re-tag it, think outside the the RDBMS box!

Catherine said...

Etienne, thanks for the pointer to geohash. It's a great idea, but not quite what I need this time - there is some correspondence between alphabetical order and geographic location, but if you are near, say, the border between the Cs and the Ds, geohash might view very nearby places as "far away" and vice versa.

Jay,

My app wants to let users search for events taking place in a specific part of the world.

What box am I thinking inside of? Location-specificity is the only reason the user would be consulting an automated search in the first place, instead of sifting through event announcements by hand.

Araf Karsh said...

take a look at the property list and merge joins

http://code.google.com/events/io/2009/sessions/BuildingScalableComplexApps.html

The property list and merge joins will let u think differently from rdbms concepts.

Robert A. Ficcaglia said...

I am evaluating GAE for a large app and am curious if this was ever resolved or if this is still an issue?

xpmatteo said...

Coming very late to this, but: you could normalize coords to a square of, say, 10km per side. You save that square as a single field in the events table. When I look for events near to me, you take my coordinates and compute the square I'm in, and the adjacent squares so that I don't miss something just because we are on opposite sides of a square's border. Would that have helped?

Bjorn Roche said...

I am currently deciding whether to use GAE for a new project and found a link to this. At first this seems like a "major fail" but understanding how the tables work, I can see how this would actually improve performance overall. The trick is figuring out how to solve this problem in the first place. I admit it's not as simple as SQL, but it doesn't scare me away from GAE either.

I assume you tried performing 2 queries and to taking the intersection of the results. That may not go any faster than your current method.

xpmatteo's solution would work fine, but if you wanted searches of varying distances, that would get tricky. By switching to geohash-like structures and getting clever, I suspect you could make that work, but the math would get complex.

I would use hybrid approach where instead of tiling, I would break the geography into numbered longitudinal "bands". The band size would probably have to be chose to approximate the smallest distance the user would enter. Thus, I would calculate which band user's location belonged to and search for all events in that band. But I would also add an inequality to that search: latitude! Obviously, you'll want to search adjacent bands depending on search radius, band size, and how close the user is to the edges.

Phil Bair said...

I'm seeing a lot of comments here about thinking outside the RDBMS box. That's sounds like a fresh and trendy thing to say, but why is it so important to think outside the RDBMS box??? Relational database models are still just as valid today as they were 20 years ago. They have stood the test of time, and they still offer the best flexibility. In this discussion, thinking outside the box is just a euphemism for a workaround. BigFile is a toy compared to a powerful relational database. It's time the "outside the box" thinkers faced up to that fact. I love new technology, but only when it proves superior to the established technology. New just because it's new is foolish.