wtf is this context thing in ckan?

If you've taken a look at the ckan source code, you'll have come across 'context' as the first parameter in many of the functions. It basically contains all threadlocal information required for a function to execute. It's taken me far too long to understand why they exist and I currently think contexts in their current state are pointless.

I'm assuming contexts only contain model, session and user. I get why context exists, it basically comes down to whether
  • you prefer passing parameters for the request/session/user around into functions, so that the parameters define exactly what the function needs to execute. If we do this though, these are 'different' to other parameters which is why they've been seperated into context/data_dict. Your functions do not have to refer to some magical global object to execute.
  • Or whether you are happy using threadlocals, so that if your logic function, calls another function that calls another, that calls ..., that calls function f, f can easily get the requet/session/user by calling the flask g object/pylons g thing to get the user without having to muddy up all the other previous function callers definitions. Your compromise is that you have a magical 'global' object (say model.Session) but you can access it anywhere in the code safely without thinking about it too much.
There is no right answer, the guy who wrote flask wrote this a blog post detailing his hate for thread locals before he wrote flask, where the decision was made to use them, I know that David R is not a fan of them

The problem I think we have, is that ckan straddles both sides and we end up with a worse, really confusing, solution. Currently in ckan, we use the scoped_session, which is a threadlocal based setup for sqlalchemy, So when you pass ckan.model from your extension into an action function, unless theres is some additional voodoo(possibly vdm, i haven't looked at that code), then ckan.model.Session and context['session'] are currently always the same. This is incredibly confusing, especially to new ckan developers.

So if context['session'] is the current sqlalchemy session, that we're passing around, then why are we passing context['model'] around? Well it's all to do with the model code. Take a look at model/package.py and you'll see it using meta.Session.query all over the shop. For example,
 # ckan/model/package.py
 import meta  
 class Package(...):  
 ...   
   @classmethod  
   def get(cls, reference):  
     '''Returns a package object referenced by its id or name.'''  
     query = meta.Session.query(cls).filter(cls.id==reference)  
     pkg = query.first()  
     if pkg == None:  
       pkg = cls.by_name(reference)  
     return pkg  

In much of the logic layer you'll see that all the action functions begin
 # ckan/logic/action/get.py
def package_show(context, data_dict):  
   model = context['model']  
   session = context['session']
   pkg = model.Package.get(name_or_id)

If we didn't include this line and imported ckan.model in the logic layer and ran model.Package.get('blah'), we'd be referring to the threadlocal sqlalchemy session(the "global" one), not the one we passed through context['model'](even if they are they both refer to the same thing)

So to get round this we pass context['model'] into the logic functions and generally have a line "model = context['model']". This ensures we'll be referring to the session object that we passed in as a parameter through context throughout the code.

So context['model'] is a way of avoiding references to the threadlocal/"global" model.Session object for the model code.

All callees in ckan extensions usually end up constructing a context and passing it in, many of the extensions I've written contain.
 def some_method_in_a_controller(self):  
   some_stuff()  
   ...  
   context = {'model': model, 'session': model.Session', 'user': c.user }  
   data_dict = { 'some': stuff }  
   toolkit.get_action('some_action')(context, data_dict)  
In reality it's better to do
 def some_method_in_a_controller(self):  
   some_stuff()  
   ...  
   data_dict = { 'some': stuff }  
   toolkit.get_action('some_action')(data_dict=data_dict)  
This is because get_action() will construct a default context for you if you don't specify it. Also because, you as an extension writer, do not care about context. (Also in the future it means we can refactor/remove it more easily without breaking your code!)

Anyway, we've muddied up our code because I never really understood it, but it's how our extensions are written, so everyone copies our code. I personally think it would be better to pass the session to the model code.

 class Package(...):  
 ...   
   @classmethod  
   def get(cls, reference, session):  
     '''Returns a package object referenced by its id or name.'''  
     query = session.query(cls).filter(cls.id==reference)  
     pkg = query.first()  
     if pkg == None:  
       pkg = cls.by_name(reference)  
     return pkg  

This way, you would no longer need to pass context['model'] because the action functions would end up looking like
 def package_show(context, data_dict):  
   user = context['user'] # if i'm using it
   model.Package.get('blah', session=context['session'])  

Wait, wait, or even better
 def package_show(session, user, schema, dataset_id, some_real_parameters_so_I_know_the_parameters_without_looking_at_the_schema):  
   model.Package.get('blah', session=session)  

This would clean up tests as well, the way we supply user into the context is a pain in the backside. When you're writing a test using the factories and you're passing a user in as an argument, but you might have been given a user_dict by another action function in a test, so what do you use? the username or the user_dict? The answer is we have magic which takes a stab at figuring it out so the context['user'] is setup properly.

The problem is even with context, we're still using all the pylons g,c and whatever else everywhere anyway. So we have the problem of threadlocal objects everywhere in the code, but we still have to pass our session/user everywhere, (but not the request, we use pylons threadlocal request for that). Hence why I say contexts in their current state are pointless.

Additionally, how do we supply a validation schema? Currently, this is through the context, but this isn't threadlocal data I hear you say? Well, it'd also be wrong to pass the schema through the data_dict, because the schema will be validating the data_dict right? So unless there is some extra code to pop the schema from the data_dict, it was easier just to jam it in the context.

So some of our options could be.
  • make a context namedtuple, but change it's name or properly explain what the hell this thing is in the documentation
  • think about whether we want to include request info into the context, i know some actions have context['for_view'] for some reason or another.
  • get rid of context. This means we'd have to have our own ckan.g threadlocal thing (to move off pylons one and onto flask or whatever) that we can place the username/request in. but eww this is gross
  • get rid of context model, context session. and just let people use ckan.model and model.session, I end up having to import both of these into extensions anyway and people won't have to think about context anymore.
  • get rid of context['user'] by having helper functions get_current_username or something which behind the scenes gets the username from pylons.c.user or whatever(if we ever change to something else)
  • model and session should be optional parameters where their default arguments are ckan.model and can.model.session so the caller doesn't have to specify them each time they use an action function.
  • continue to make blog posts like these but never have the time to change anything anyway.
I'm thinking currently that I like helpers get_current_username as it can be mocked out in tests and users of our code won't need to think of this crap and just call toolkit.get_current_username in their code, and I also like just letting them use model.Session because they won't sit their thinking 'wtf is this context thing' whenever they want to call an action function in their extension, but I haven't really thought any of this through fully.

Perhaps the best way is to try and change the action functions so session and user are just passed in as normal parameters but have them default to model.Session?

TL;DR I'm sure I'll change my mind by next month, but generally, we should change context so that it's not baffling to the user. I think by hiding from the user entirely, but we could change it to a namedtuple and documenting what the hell it actually is, or just making them non confusing normal parameters. I'm sure in a years time, the state of this will not have changed.

This post is just from ckan/ideas-and-roadmap#53

Popular posts

Digging into python memory issues in ckan with heapy

Randomising traitor numbers in Trouble in Terrorist Town

OpenMoonstone v0.2 released