Friday, November 13, 2015

Wanted: RDBMS superpower summary for app developers

At last night's WWCode Cincinnati panel, I recommended that developers talk to their DBA about what advanced capabilities their RDBMS can offer, so that they don't end up reimplementing functionality in the app that are already available (better and more efficiently) in the database itself. Devs can waste a lot of effort by thinking of databases as dumb, inert data boxes.

I was asked an excellent question: "Where can a dev quickly familiarize herself with what those capabilities are?" My answer was, "Um."

Do not say they should read the docs. That is a "let them eat cake" answer. The PostgreSQL docs are over 2900 pages. That's not what they need.

Suggestions, folks? Python developers have built great summary sites, like the Hitchhiker's Guide to Python. What are the equivalents in the database world? Do they exist? Do we need to write them?

Friday, July 10, 2015

Code Studio rocks; diversity does, too

If you want to quickly get some kids introduced to computer programming concepts, you could do a lot worse than using Code Studio from code.org. That's what I did the last couple weeks - took two hours to lightly shepherd the Dayton YWCA day camp through a programming intro.

It's really well-organized and easy to understand - frankly, it pretty much drives itself. It's based on block-dragging for turtle graphics and/or simple 2D games, all easy and appealing stuff. (They even got their turtle graphics branded as the sisters from Frozen ice-skating!) I didn't need to do much more than stand there and demonstrate that programmers actually exist in the flesh, and occasionally nudge a student over a bump. Though, by pair programming, they did most of the nudging themselves.

Here's most of my awesome class. Sorry I'm as bad at photography as at CSS.

Hey - we got demographics, huh? Right - if you announce that you're teaching a coding class through your usual geeky circles, they spread the word among their circles and recruit you a class that looks pretty much like the industry already looks. And if you seek a venue through your geeky circles, the usual suspects will step up to host. In badly segregated Dayton, that means "as far from the colored parts of town as possible." That's less than inviting to the people who don't live there.

But if you partner with groups that already have connections in diverse communities - like the YWCA, which makes anti-racism one of its keystones - getting some fresh faces can be pretty easy! And there are venues available outside the bleached-white exurbs you're used to - you just need to think to look.

Another benefit of Code Studio is that it's entirely web-based, so you don't need to restrict your demographics to "kids whose parents can afford to get them laptops". The public library's computer classroom did the job with flying colors.

Seriously, this was about the easiest outreach I've ever done. I'm working on the follow-up, but I think I'll be able to find further lazy options. Quite likely it will leverage CodeAcademy. So, what's your excuse for not doing it in your city?

Now, in other news: You are running out of time to register for PyOhio, a fantastic, friendly, free, all-levels Python conference, and my pride and joy. The schedule is amazing this year, and for better or for worse, I'm keynoting. So please come and add to my terror.

Friday, November 14, 2014

rdbms-subsetter

I've never had a tool I really liked that would extract a chunk of a large production database for testing purposes while respecting the database's foreign keys. This past week I finally got to write one: rdbms-subsetter.

rdbms-subsetter postgresql://user:passwd@host/source_db postgresql://user:passwd@host/excerpted_db 0.001

Getting it to respect referential integrity "upward" - guaranteeing every needed parent record would be included for each child row - took less than a day. Trying to get it to also guarantee referential integrity "downward" - including all child records for each parent record - was a Quixotic idea that had me tilting at windmills for days. It's important, because parent records without child records are often useless or illogical. Yet trying to pull them all in led to an endlessly propagating process - percolation, in chemical engineering terms - that threatened to make every test database a complete (but extremely slow) clone of production. After all, if every row in parent table P1 demands rows in child tables C1, C2, and C3, and those child rows demand new rows in parent tables P2 and P3, which demand more rows in C1, C2, and C3, which demand more rows in their parent tables... I felt like I was trying to cut a little sweater out of a big sweater without snipping any yarns.

So I can't guarantee child records - instead, the final process prioritizes creating records that will fill out the empty child slots in existing parent records. But there will almost inevitably be some child slots left open when the program is done.

I've been using it against one multi-GB, highly interconnected production data warehouse, so it's had some testing, but your bug reports are welcome.

Like virtually everything else I do, this project depends utterly on SQLAlchemy.

I developed this for use at 18F, and my choice of a workplace where everything defaults to open was wonderfully validated when I asked about the procedure for releasing my 18F work to PyPI. The procedure is - and I quote -

Just go for it.

Tuesday, August 26, 2014

%sql: To Pandas and Back

A Pandas DataFrame has a nice to_sql(table_name, sqlalchemy_engine) method that saves itself to a database.

The only trouble is that coming up with the SQLAlchemy Engine object is a little bit of a pain, and if you're using the IPython %sql magic, your %sql session already has an SQLAlchemy engine anyway. So I created a bogus PERSIST pseudo-SQL command that simply calls to_sql with the open database connection:

%sql PERSIST mydataframe

The result is that your data can make a very convenient round-trip from your database, to Pandas and whatever transformations you want to apply there, and back to your database:


In [1]: %load_ext sql

In [2]: %sql postgresql://@localhost/
Out[2]: u'Connected: @'

In [3]: ohio = %sql select * from cities_of_ohio;
246 rows affected.

In [4]: df = ohio.DataFrame()

In [5]: montgomery = df[df['county']=='Montgomery County']

In [6]: %sql PERSIST montgomery
Out[6]: u'Persisted montgomery'

In [7]: %sql SELECT * FROM montgomery
11 rows affected.
Out[7]: 
[(27L, u'Brookville', u'5,884', u'Montgomery County'),
 (54L, u'Dayton', u'141,527', u'Montgomery County'),
 (66L, u'Englewood', u'13,465', u'Montgomery County'),
 (81L, u'Germantown', u'6,215', u'Montgomery County'),
 (130L, u'Miamisburg', u'20,181', u'Montgomery County'),
 (136L, u'Moraine', u'6,307', u'Montgomery County'),
 (157L, u'Oakwood', u'9,202', u'Montgomery County'),
 (180L, u'Riverside', u'25,201', u'Montgomery County'),
 (210L, u'Trotwood', u'24,431', u'Montgomery County'),
 (220L, u'Vandalia', u'15,246', u'Montgomery County'),
 (230L, u'West Carrollton', u'13,143', u'Montgomery County')]

Monday, July 28, 2014

auto-generate SQLAlchemy models

PyOhio gave my lightning talk on ddlgenerator a warm reception, and Brandon Lorenz got me thinking, and PyOhio sprints filled my with py-drenaline, and now ddlgenerator can inspect your data and spit out SQLAlchemy model definitions for you:

$ cat merovingians.yaml 
-
  name: Clovis I
  reign:
    from: 486
    to: 511
-
  name: Childebert I
  reign:
    from: 511
    to: 558
$ ddlgenerator --inserts sqlalchemy merovingians.yaml 

from sqlalchemy import create_engine, Column, Integer, Table, Unicode
engine = create_engine(r'sqlite:///:memory:')
metadata = MetaData(bind=engine)

merovingians = Table('merovingians', metadata, 
  Column('name', Unicode(length=12), nullable=False), 
  Column('reign_from', Integer(), nullable=False), 
  Column('reign_to', Integer(), nullable=False), 
  schema=None)

metadata.create_all()
conn = engine.connect()
inserter = merovingians.insert()
conn.execute(inserter, **{'name': 'Clovis I', 'reign_from': 486, 'reign_to': 511})
conn.execute(inserter, **{'name': 'Childebert I', 'reign_from': 511, 'reign_to': 558})
conn.connection.commit()

Brandon's working on a pull request to provide similar functionality for Django models!

Tuesday, July 01, 2014

18F

Yesterday was my first day at 18F!

What is 18F? We're a small, little-known government organization that works outside the usual channels to accomplish special projects. It involves black outfits and a lot of martial arts.

Kidding! Sort of. 18F is a new agency within the GSA that does citizen-focused work for other parts of the U.S. Government, working small, quick projects to make information more accessible. We're using all the tricks: small teams, agile development, rapid iteration, open-source software, test-first, continuous integration. We do our work in the open.

Sure, this is old hat to you, faithful blog readers. But bringing it into government IT work is what makes it exciting. We're hoping that the techniques we use will ripple out beyond the immediate projects we work on, popularizing them throughout government IT and helping efficiency and responsiveness throughout. This is a chance to put all the techniques I've learned from you to work for all of us. Who wouldn't love to get paid to work for the common good?

Obviously, this is still my personal blog, so nothing I say about 18F counts as official information. Just take it as my usual enthusiastic babbling.

Friday, May 23, 2014

ddlgenerator

I've had it on github for a while, but I finally released ddlgenerator to PyPI.

I've been frustrated for years that there was no good open-source way to set up RDBMS tables from flat data files. Sure, you could import the data - after setting up the DDL by hand. ddlgenerator handles that; in fact, you can go from zero, setting up and populating a table in a single line. Nothing up my sleeve:

$ psql -c "SELECT * FROM knights"
ERROR:  relation "knights" does not exist
LINE 1: SELECT * FROM knights
                      ^
$ ddlgenerator --inserts postgresql knights.yaml | psql
CREATE TABLE
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
$ psql -c "SELECT * FROM knights"
    name    |         dob         |   kg    | brave 
------------+---------------------+---------+-------
 Lancelot   | 0471-01-09 00:00:00 | 82.0000 | t
 Gawain     |                     | 69.2000 | t
 Robin      | 0471-01-09 00:00:00 |         | f
 Reepacheep |                     |  0.0691 | t

This is a fairly complex tool so I'm sure you'll be using the bug tracker. But I hope you'll enjoy it nonetheless!