Saturday, June 11, 2011
Continuing my work on the Pure Python port, I did a small amount of profiling with the outstanding runsnakerun to determine what was bogging it down. As expected, the slow speed of Python's math operations -- and especially the matrix and vector operations -- were the major bottleneck.
Having said that, it would appear that there's very little I can do for the Pure Python port to speed it up. If you followed the previous post, you'll know that I tried PyPy (it works after a large number of iterations) and Cython (faster, but still not so great). I considered using some library like NumPy to get its matrix and vector functionality, but including such a massive library just for this tiny physics library just felt wrong.
I stumbled upon quite a few matrix and vector C extensions for Python, notably Casey Duncan's excellent Planar project. However, there were always issues with all of the ones I found: not supporting tuples in place of vector and matrix types, immutable vectors (though this can be a good thing for other projects which don't expect it), and just the general headache of requiring restructuring all of the pypybox2d code that already relies on vector and matrix classes with certain properties and functions. [side-note: were it not for this project, I would definitely be using the Planar library.]
In short, and for better or worse, I re-invented the wheel, creating pure C extensions for the Vec2, Mat22, Transform, and AABB types (should be compatible with Python 2.5~3.2, but let me know your results). Performance significantly improved (sorry, I don't have any benchmarks right now -- but run any of the testbed tests and you'll feel it). Most of the testbed tests run "fast enough". Exceptions include the pyramid and vertical stack ones, which have altogether too many contacts, and the solver gets very slow.
After adding pickling support for the above classes, I then really wanted to give multiprocessing a go. I made a simple test with a set of processes dedicated to doing shape distance calculations (i.e., consumers). I passed in thousands and thousands of shapes and transforms, and it properly generated the results. Unfortunately, comparing it to the single process version, regardless of the number of input sets, the multi-process version was on the order of 10-50x (!) slower -- not even including the time it took to start the processes. All of the overhead, presumably from synchronization and pickling the objects to pass between processes is just too expensive. It looks as if this will not be a viable option for a real-time physics engine.
Now, where to go from here? I'm stumped -- err, I mean -- I'm open to suggestions!