Saturday, June 11, 2011

(Mostly) Pure Python Physics


Continuing my work on the Pure Python port, I did a small amount of profiling with the outstanding runsnakerun to determine what was bogging it down. As expected, the slow speed of Python's math operations -- and especially the matrix and vector operations -- were the major bottleneck.

Having said that, it would appear that there's very little I can do for the Pure Python port to speed it up. If you followed the previous post, you'll know that I tried PyPy (it works after a large number of iterations) and Cython (faster, but still not so great). I considered using some library like NumPy to get its matrix and vector functionality, but including such a massive library just for this tiny physics library just felt wrong.

I stumbled upon quite a few matrix and vector C extensions for Python, notably Casey Duncan's excellent Planar project. However, there were always issues with all of the ones I found: not supporting tuples in place of vector and matrix types, immutable vectors (though this can be a good thing for other projects which don't expect it), and just the general headache of requiring restructuring all of the pypybox2d code that already relies on vector and matrix classes with certain properties and functions. [side-note: were it not for this project, I would definitely be using the Planar library.]

In short, and for better or worse, I re-invented the wheel, creating pure C extensions for the Vec2, Mat22, Transform, and AABB types (should be compatible with Python 2.5~3.2, but let me know your results). Performance significantly improved (sorry, I don't have any benchmarks right now -- but run any of the testbed tests and you'll feel it). Most of the testbed tests run "fast enough". Exceptions include the pyramid and vertical stack ones, which have altogether too many contacts, and the solver gets very slow.

After adding pickling support for the above classes, I then really wanted to give multiprocessing a go. I made a simple test with a set of processes dedicated to doing shape distance calculations (i.e., consumers). I passed in thousands and thousands of shapes and transforms, and it properly generated the results. Unfortunately, comparing it to the single process version, regardless of the number of input sets, the multi-process version was on the order of 10-50x (!) slower -- not even including the time it took to start the processes. All of the overhead, presumably from synchronization and pickling the objects to pass between processes is just too expensive. It looks as if this will not be a viable option for a real-time physics engine.

Now, where to go from here? I'm stumped -- err, I mean -- I'm open to suggestions!

11 comments:

  1. Excellent detective work. I think creating pure C extensions to keep it light and focused was a good choice in this situation. I looked at the C extension code and it's very clear. I like hearing "performance significantly improved"!

    That's interesting to hear your results from using multiprocessing. I wish I had some ideas of where to look next...

    Didn't know about runsnakerun--it *is* fantastic, thanks for mentioning it!

    ReplyDelete
  2. Thanks for checking it out, michael. :)

    Porting more things to C seems to be a logical step, but what exactly should be ported is no longer as clear as it was with the vector (etc.) types. Everything that remains is tied together fairly tightly.

    I still have yet to get a line profiler (as in one that shows how long each line is taking instead of just function calls) working. It might help shed some light on what's happening in the "empty areas" in the runsnakerun graphs.

    ReplyDelete
  3. Just want to write again and say how much *fun* I'm having with PyBox2D! Your examples give me lots of ideas to places to explore too! Thank you!

    ReplyDelete
  4. It's always nice to hear that. :)

    ReplyDelete
  5. Hi Ken,

    Wow, this looks very impressive.

    I gotta admit I haven't checked your C extensions but IMHO Numpy would be the best way to go in terms of speed, accuracy and robustness. Testing and verification is the most difficult part of creating scientific/math libraries and Numpy is well tested and verified.
    Documentation is another hassle and Numpy shines here too.

    If your users are going to use the library for educational purposes then exposing Numpy out of the box would provide an excellent environment for many different use cases.

    I am not sure how you tested Numpy but instead of using loops if you vectorize the data then Numpy should be as fast as the C++ version of Box2D if not faster. Years ago I have tested Numpy against some C and Fortran libraries and results were very good. I wouldn't be surprised if it got even faster since then.

    If your primary goal is to keep the library "small" then it may be possible to compile a subset of Numpy (only matrix and vector algebra modules perhaps?).

    As you can see I am a big fan of Numpy and it seems like size is the only downside of using it.

    Fahri

    ReplyDelete
  6. Hi Fahri,

    As you say, it requires vectorization to get the speed benefits of Numpy. That would require a *great* deal of refactoring, which I unfortunately did not (and do not) have the time for. Without vectorization, I highly doubt there is any significant performance difference with my little module.

    As it is, the pure Python port is a (mostly) 1:1 conversion to Python, with some Python niceties as I saw fit. The addition of the C module was primarily for learning purposes -- as it was my first pure C extension -- and secondarily just to see how much faster it could be.

    I'd be very happy to see someone else modify the engine to work with Numpy (or my preferred Planar).

    Myself, however, I'm afraid I've reached the limit of my inspiration for now.

    Ken

    ReplyDelete
  7. You might want to try ShedSkin (http://code.google.com/p/shedskin/) it's pretty fast (even faster than pypy on several benchmarks)

    ReplyDelete
  8. Here is an idea about parallel computing. Have you considered assigning each "island" to individual processors (or processes)? Since each island is an independent group of elements (they are either connected by connectors or in contact with each other) they are perfect candidates for parallel computing. Would book keeping, inter-communication, and assembly/disassembly cost be high enough to offset the savings?

    ReplyDelete
  9. mgo12,

    I wish I could get ShedSkin working with it! If you can, let me know.

    Fahri,

    Several threads on the Box2D forums have mentioned it before. That is definitely something I would like to investigate at some point.

    Also, I've been thinking that I could potentially use multiprocessing's shared lists (or maybe even proxies, somehow?) to side-step serialization of the objects. I'll have to look to see if anyone's already done any benchmarks before I spend my own time.

    ReplyDelete
  10. kne,

    Have you considered using Cython as a replacement for swig instead of re-implementing the core bits of box2d?

    I understand why you dislike swig. I don't like it because you are kind of stuck with the interface that it gives you and you have an ugly mass of interface files. The interface is often less than ideal and cannot properly handle some circumstances.

    Wrapping C++ libraries in Cython is easy. This method is less painful that it seems. Crossing the python/C++ boundary is pretty straightforward. See a good example use case here: http://www.panda3d.org/blog/?p=173

    Also I would be interested in what you did as far as adding Cython keywords to help speed up your code in your first attempt. Unless you are willing to heavily mark up your code with Cython types you are not going to see much more than a 30% improvement. If you are concerned with still running your code in "pure python" mode you can use Cython's decorators to make both possible.

    If it were me I would use Cython to write a new wrapper from scratch so I could create a more python-like interface to the C++ objects.

    I really hate to see you duplicating the efforts of box2D with a performance penalty.

    ReplyDelete
  11. Hi Xistic,

    I fully realized from the start that the pure python port would be an exercise in futility, at least from the standpoint of having a full-speed port. However, I really wanted to see how it would turn out -- and inspiration like that is not something to ignore. In no real way did I expect it to be a replacement for some sort of wrapped port (although I can't say I didn't dream of that possibility). I think it's also a fairly clear port that people wondering about the internals of Box2D could take a look at.

    I fully agree that moving away from SWIG is the right way to go. Whether it be Cython, SIP, Pyrex, or whatever, they would all be much better suited than SWIG. Were I to do it, I think the interface would be similar to the pure python port.

    However, the question is this: is there even a sufficient user base to make the effort of redoing the library meaningful? I don't see many new projects including pybox2d, and those that do seem to use the old versions.

    Perhaps the reason for that is that from version to version there have been complaints that my attempts to improve things break backward compatibility -- which as a library user I can understand. At the same time, though, this leaves me with a set of poor decisions I made years ago when I hadn't a clue about releasing libraries.

    If minor backward compatibility issues are enough to ignore the newer versions, I can't imagine what the response would be to a completely rewritten library.

    ReplyDelete