Digging into python memory issues in ckan with heapy
So we had a report about a memory leak when using the ckan datastore extension, where large queries to the datastore would leak large amounts of memory per request. It wasn't simple to get to the bottom of it, at first I couldn't recreate it the leak at all. The test data I was using was the STAR experiment csv files, which I found when I googled 'Large example csv files'. The reporter Alice Heaton, had kindly written a script that would recreate the leak. Even with this, I could not recreate the problem, until I upped the number of rows fetched by a factor of ten. I suspect that Alice has more data per column with perhaps large text fields instead of the mainly numeric data of the STAR experiment data I was using. Once I could reliably recreate the problem, I ended up poking around using heapy , which I've used previously to track down similar problems and inserted some code to setup heapy and an ipdb breakpoint from guppy import hpy hp = hpy() heap ...
I'm assuming contexts only contain model, session and user. I get why context exists, it basically comes down to whether
f
,f
can easily get the requet/session/user by calling the flask g object/pylons g thing to get the user without having to muddy up all the other previous function callers definitions. Your compromise is that you have a magical 'global' object (say model.Session) but you can access it anywhere in the code safely without thinking about it too much.The problem I think we have, is that ckan straddles both sides and we end up with a worse, really confusing, solution. Currently in ckan, we use the scoped_session, which is a threadlocal based setup for sqlalchemy, So when you pass ckan.model from your extension into an action function, unless theres is some additional voodoo(possibly vdm, i haven't looked at that code), then ckan.model.Session and context['session'] are currently always the same. This is incredibly confusing, especially to new ckan developers.
So if context['session'] is the current sqlalchemy session, that we're passing around, then why are we passing context['model'] around? Well it's all to do with the model code. Take a look at model/package.py and you'll see it using meta.Session.query all over the shop. For example,
In much of the logic layer you'll see that all the action functions begin
If we didn't include this line and imported ckan.model in the logic layer and ran model.Package.get('blah'), we'd be referring to the threadlocal sqlalchemy session(the "global" one), not the one we passed through context['model'](even if they are they both refer to the same thing)
So to get round this we pass context['model'] into the logic functions and generally have a line "model = context['model']". This ensures we'll be referring to the session object that we passed in as a parameter through context throughout the code.
So context['model'] is a way of avoiding references to the threadlocal/"global" model.Session object for the model code.
All callees in ckan extensions usually end up constructing a context and passing it in, many of the extensions I've written contain.
In reality it's better to do
This is because get_action() will construct a default context for you if you don't specify it. Also because, you as an extension writer, do not care about context. (Also in the future it means we can refactor/remove it more easily without breaking your code!)
Anyway, we've muddied up our code because I never really understood it, but it's how our extensions are written, so everyone copies our code. I personally think it would be better to pass the session to the model code.
This way, you would no longer need to pass context['model'] because the action functions would end up looking like
Wait, wait, or even better
This would clean up tests as well, the way we supply user into the context is a pain in the backside. When you're writing a test using the factories and you're passing a user in as an argument, but you might have been given a user_dict by another action function in a test, so what do you use? the username or the user_dict? The answer is we have magic which takes a stab at figuring it out so the context['user'] is setup properly.
The problem is even with context, we're still using all the pylons g,c and whatever else everywhere anyway. So we have the problem of threadlocal objects everywhere in the code, but we still have to pass our session/user everywhere, (but not the request, we use pylons threadlocal request for that). Hence why I say contexts in their current state are pointless.
Additionally, how do we supply a validation schema? Currently, this is through the context, but this isn't threadlocal data I hear you say? Well, it'd also be wrong to pass the schema through the data_dict, because the schema will be validating the data_dict right? So unless there is some extra code to pop the schema from the data_dict, it was easier just to jam it in the context.
So some of our options could be.
get_current_username
or something which behind the scenes gets the username from pylons.c.user or whatever(if we ever change to something else)get_current_username
as it can be mocked out in tests and users of our code won't need to think of this crap and just calltoolkit.get_current_username
in their code, and I also like just letting them use model.Session because they won't sit their thinking 'wtf is this context thing' whenever they want to call an action function in their extension, but I haven't really thought any of this through fully.Perhaps the best way is to try and change the action functions so session and user are just passed in as normal parameters but have them default to model.Session?
TL;DR I'm sure I'll change my mind by next month, but generally, we should change context so that it's not baffling to the user. I think by hiding from the user entirely, but we could change it to a namedtuple and documenting what the hell it actually is, or just making them non confusing normal parameters. I'm sure in a years time, the state of this will not have changed.
This post is just from ckan/ideas-and-roadmap#53