Tuesday, August 26, 2014

%sql: To Pandas and Back

A Pandas DataFrame has a nice to_sql(table_name, sqlalchemy_engine) method that saves itself to a database.

The only trouble is that coming up with the SQLAlchemy Engine object is a little bit of a pain, and if you're using the IPython %sql magic, your %sql session already has an SQLAlchemy engine anyway. So I created a bogus PERSIST pseudo-SQL command that simply calls to_sql with the open database connection:

%sql PERSIST mydataframe

The result is that your data can make a very convenient round-trip from your database, to Pandas and whatever transformations you want to apply there, and back to your database:

In [1]: %load_ext sql

In [2]: %sql postgresql://@localhost/
Out[2]: u'Connected: @'

In [3]: ohio = %sql select * from cities_of_ohio;
246 rows affected.

In [4]: df = ohio.DataFrame()

In [5]: montgomery = df[df['county']=='Montgomery County']

In [6]: %sql PERSIST montgomery
Out[6]: u'Persisted montgomery'

In [7]: %sql SELECT * FROM montgomery
11 rows affected.
[(27L, u'Brookville', u'5,884', u'Montgomery County'),
 (54L, u'Dayton', u'141,527', u'Montgomery County'),
 (66L, u'Englewood', u'13,465', u'Montgomery County'),
 (81L, u'Germantown', u'6,215', u'Montgomery County'),
 (130L, u'Miamisburg', u'20,181', u'Montgomery County'),
 (136L, u'Moraine', u'6,307', u'Montgomery County'),
 (157L, u'Oakwood', u'9,202', u'Montgomery County'),
 (180L, u'Riverside', u'25,201', u'Montgomery County'),
 (210L, u'Trotwood', u'24,431', u'Montgomery County'),
 (220L, u'Vandalia', u'15,246', u'Montgomery County'),
 (230L, u'West Carrollton', u'13,143', u'Montgomery County')]


erlebnishengst said...

Really cool idea. Useful for many cases, when accessing a datawarehouse from iPython. Thanks for sharing it.

Just tested with our 12GB db, which holds weather and environmental data. Worked perfectly.

Daniel Black said...

Hi Catherine. I came here by having stumbled ipython-sql from...somewhere. I don't know; that tab's been open for weeks. Anyway, looks great. I'm wondering how you feel Yhat's db.py would compare, specifically for use in IPython notebooks. They just announced it today, so this is a bit premature, but wondered all the same.

Additionally, by pure serendipity, I only after looking at ipython-sql and came here realized I'd seen you talk at PyOhio 2014. My wife attended PyCamp, and then we both showed up for talks that Saturday. Great work all around.

Lastly, I had no idea there was such a thing as TRUCEConf. Now it's long over and the URL is no longer active. Do you anticipate they'll run it again next year?



Ram Kotamaraja said...

Hi Catherine,

I am using your ipython-sql to connect to impala and run queries. It works great with the following url.

%sql impala://aws-impala-url:21050

But, when I try to use the same statement in kerberos environment, it not working and getting below errors. Is kerberos supported in the library? If it is supported, what is the right url or right way to use it?

TTransportException: TSocket read 0 bytes
ERROR:impala.hiveserver2:Failed to open transport

StatementError: (thrift.transport.TTransport.TTransportException) TSocket read 0 bytes [SQL: u'select version()']

I appreciate your input.