Digging into python memory issues in ckan with heapy
So we had a report about a memory leak when using the ckan datastore extension, where large queries to the datastore would leak large amounts of memory per request. It wasn't simple to get to the bottom of it, at first I couldn't recreate it the leak at all. The test data I was using was the STAR experiment csv files, which I found when I googled 'Large example csv files'. The reporter Alice Heaton, had kindly written a script that would recreate the leak. Even with this, I could not recreate the problem, until I upped the number of rows fetched by a factor of ten. I suspect that Alice has more data per column with perhaps large text fields instead of the mainly numeric data of the STAR experiment data I was using. Once I could reliably recreate the problem, I ended up poking around using heapy , which I've used previously to track down similar problems and inserted some code to setup heapy and an ipdb breakpoint from guppy import hpy hp = hpy() heap ...
For a while, we kept on running into issues where paster db clean would hang and I spent a while trying to figure out what the hell was going on and what was wrong with the session.
As it turned out, there was nothing wrong session really, it was just that we were handling it badly for command line functions. There were basically 2 problems.
So we're not actually connected to the database yet, sqlalchemy will connect when we issue a query to it.
We'll run into our second problem next.
Our Session registry currently has an active session. Of which our dataset object is a part of, next we run get_site_user.
get_site_user not only commits, but removes the session and closes the session's connection. Our registry is now closed.
If we now try to use any ORM related features of our object, such as following a relation, we'll get a DetachedInstanceError. In the case of ckanext spatial this is caused by the previous_object becoming detached as get_site_user is called after it. When the previous_object is used later, probably here as it tries to get the HarvestJob object it'll fail, accessing other stuff like previous_object.current will work fine as it doesn't have to goto the database to fetch a related object.
So we ended up having to fix the first problem by closing the transactions by get_site_user and get_system_info, and we could do something like a model.Session.remove() before the drop_all in clean_db, (we do a similar thing in delete_all())
The second problem was fixed by changing get_site_user to only commit and not remove the session.
These two problems were fixed in 2.1.3 and 2.2.1 onwards so hopefully you'll never have to run into them.
This post is basically ckan/ckan#1714