Randomising traitor numbers in Trouble in Terrorist Town
A while back a group of friends and I went through a spate of playing TTT(Trouble in Terrorist Town), to the point where we ended up playing during that Christmas day post dinner lull. For me, TTT scratches the traitor game mechanic that you can find in board games like The Resistance, Battlestar Galactica and Shadows Over Camelot which you don't really get in any other computer game. Unlike these games, the number of traitors is rigidly defined by the total number of players by a formula. local function GetTraitorCount (ply_count) -- get number of traitors: pct of players rounded down local traitor_count = math.floor (ply_count * GetConVar( "ttt_traitor_pct" ):GetFloat()) -- make sure there is at least 1 traitor traitor_count = math.Clamp(traitor_count, 1 , GetConVar( "ttt_traitor_max" ):GetInt()) return traitor_count end This means that for a given number of players, the number of traitors/innocents is the same each round. This i...
Once I could reliably recreate the problem, I ended up poking around using heapy, which I've used previously to track down similar problems and inserted some code to setup heapy and an ipdb breakpoint
before the controller returns the result from the api.
There are some tutorials hanging around for heapy [1] so I won't repeat the detail, but It looked like the previous responses were hanging around in memory and they were the root cause of the problem.
By inspecting those response strings using byid I could see that specifically the response strings are the problem and from the output below it was clear that the response strings were consuming (in my case) 90 odd percent of the memory that was currently in use by strings. but I didn't really have a good idea as to why they were not getting garbage collected.
I then tried to find what was referencing the strings, perhaps we were holding an extra reference to the request object some where in ckan. Below is the shortest paths of references that heapy can find for the above strings
So the
"_RequestConfig__shared_state'].??[ < weakref...x4e56250>"
in suggested that these objects should of been cleaned up once there was no longer a reference to them, a manualgc.collect()
didn't seem to do anything and inspecting theRequestConfig shared_state
showed it to be a thread.local object.Using .rp shows a tree of references, so in this case, the str is contained in a dict of pylons.controllers.util.Response and that's contained in 3, dict of pylons.util.PylonsContext and so on and so forth
At this point, I was moaning about my lack of progress with this problem on #ckan on freenode
So I wanted to see if what we have in memory is exactly the responses that we requested (or was it copying one request over and over? etc)
I ended up tweaking the script a bit and removing the
total
from the result dict returned bydatastore_search
Not exactly the best code I've ever written, but it did mean that when we inspected using .byid, the output would look something like
This way, I could directly see which of the requests made are in memory, it was clear that we were dealing with the old requests being kept around in memory. I also took a look at the next couple of strings hanging around.
At this point, the first requests we made no longer appear in memory, the requests for 100000-100003 don't appear to be in memory.
So by doing some stepping through each request and stopping before the controller returned the response each time, we can see that it maintains a maximum of 10 response strings in memory and the older requests get shuffled off
100004 has been wiped off. At this point after some discussion on irc, we thought it might be the fact that we're in debug mode, and it's just retaining the last 10 responses and their debugging information. but the problem remained when I set "debug = false" in my ckan development.ini
The thing is ten is far too perfect a number, we humans deal with numbers in batches of ten, if this was a random occurence, it was unlikely that there would just happen to be ten requests hanging around. It was more likely that the ten requests hanging around were the result of some config setting somewhere, where the default was ten.
I pieced together the weakref, thread.local stuff with http://pythonpaste.org/paste-httpserver-threadpool.html and noticed that the default number of threads for the paste http server is 10. I then tested by changing the"threadpool_workers=n" in my ckan development.ini, and each time it matched the number of strings that showed up in heap[0].byid
The other thing to notice is that if I made 10 new requests to api endpoints that return a much smaller response
you'll see the total size has dropped from 226701936 bytes to 78397616 bytes. But the entire time that I had been debugging, I had also been running
Which still reports the memory usage the same as when the larger requests were still in memory. This is apparently just the way python works (see https://groups.google.com/forum/#!topic/celery-users/jVc3I3kPtlw). If you've just made a response that took up 500mb of memory, it stands to reason that you might have to do the same again quite soon, so python will hang onto the (virtual) memory. But it's not 'leaked'
http://pylons-webframework.readthedocs.org/en/latest/controllers.html#special-methods, The next step was to investigate whether it's ok to clear out the string in something like the
__after__()
method of the controller. I ended up creating some pylons specific cleanup middleware (see https://github.com/ckan/ckan/pull/2262)Once the response string has been served, this middleware replaces the response string with a dummy one, so the original response can be garbage collected. This was lifted from https://code.google.com/p/modwsgi/wiki/RegisteringCleanupCode This middleware is can be switched on by setting 'ckan.use_pylons_response_cleanup_middleware = true' in your development.ini
So I did a comparison of with and without this enabled. Everyone likes graphs right?
Notably, the memory usage still grows with this enabled, but is a bit more stable. I also did some examining of the memory and this memory has been returned to python's free lists but has not been released by the python process. see (http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm). So we'll still use up as much memory as there are threads
So the other way around it is to do something like use gunicorn + gevent, although this isn't a fair direct comparison as this is only using one worker process, but I mainly suggest this as you can set worker processes to restart after serving a maximum number of requests (see http://docs.gunicorn.org/en/develop/configure.html#max-requests), which would release any memory that the process has been allocated but is not using.
Finally it was noted that we probably shouldn't be building up the entire response in memory before returning it this is because we're using json.dumps for the response string. It would be better if we had an iterator of some form to stream the response back.
If you want to help fix it, take a look at https://github.com/ckan/ideas-and-roadmap/issues/128
Next time I look into a python memory issue, I'd like to use meliae (https://pypi.python.org/pypi/meliae) or some other interesting tools. Anyway that was some fun sleuthing.
heapy docs
http://guppy-pe.sourceforge.net/heapy-thesis.pdf
http://smira.ru/wp-content/uploads/2011/08/heapy.html
http://haypo-notes.readthedocs.org/en/latest/heap_fragmentation.html
https://chase-seibert.github.io/blog/2013/08/03/diagnosing-memory-leaks-python.html
http://forthescience.org/blog/2014/08/16/python-and-memory-fragmentation/
weak references
https://docs.python.org/2/library/weakref.html
http://pymotw.com/2/weakref/
pylons docs
http://pylons-webframework.readthedocs.org/en/latest/execution.html
http://pythonpaste.org/paste-httpserver-threadpool.html
(This post was constructed from https://github.com/ckan/ckan/issues/1847, https://github.com/ckan/ckan/pull/2262)