I've never had a tool I really liked that would extract a chunk of a large production database for testing purposes while respecting the database's foreign keys. This past week I finally got to write one: rdbms-subsetter.
rdbms-subsetter postgresql://user:passwd@host/source_db postgresql://user:passwd@host/excerpted_db 0.001
Getting it to respect referential integrity "upward" - guaranteeing every needed parent record would be included for each child row - took less than a day. Trying to get it to also guarantee referential integrity "downward" - including all child records for each parent record - was a Quixotic idea that had me tilting at windmills for days. It's important, because parent records without child records are often useless or illogical. Yet trying to pull them all in led to an endlessly propagating process - percolation, in chemical engineering terms - that threatened to make every test database a complete (but extremely slow) clone of production. After all, if every row in parent table P1 demands rows in child tables C1, C2, and C3, and those child rows demand new rows in parent tables P2 and P3, which demand more rows in C1, C2, and C3, which demand more rows in their parent tables... I felt like I was trying to cut a little sweater out of a big sweater without snipping any yarns.
So I can't guarantee child records - instead, the final process prioritizes creating records that will fill out the empty child slots in existing parent records. But there will almost inevitably be some child slots left open when the program is done.
I've been using it against one multi-GB, highly interconnected production data warehouse, so it's had some testing, but your bug reports are welcome.
Like virtually everything else I do, this project depends utterly on SQLAlchemy.
I developed this for use at 18F, and my choice of a workplace where everything defaults to open was wonderfully validated when I asked about the procedure for releasing my 18F work to PyPI. The procedure is - and I quote -
Just go for it.