I've since been in contact with Object Arts (awesome support, guys!) and been learning more about their compiler and VM implementation. I've written compilers myself, so this knowledge isn't falling on deaf ears. I'd rather not quote anything said or try and paraphrase comments for fear of stating something out of context. But, here's a quick summary of my optimizations so far....
Part 1: Points
Point objects are evil! Simply put, do not use them for temporary values unless they are only being used rarely - and I do mean rarely. Definitely not something to use in an #advance: method (called every frame for every actor).
I went through my transform class and added a handful of new methods that would not cause new points to be created. Similarly, I went into each method's code and made sure I wasn't calling methods with temporary points, and decided to inline any code that required this.
Last, I added some destructive, loose methods to the Point class. These were for rotating, normalizing, and getting the squared magnitude, without the need for creating a new, temporary Point object. And I then went through all the game code and made the necessary changes needed to support these adjustments.
These changes alone brought the terrible-case framerate from 150 up to about 300. It's great that a few changes could accomplish so much, but I think it's sad that the Point class is poorly implemented. Hopefully in the future Object Arts can make points a primitive type and really improve their performance.
Part 2: ByteArrays
Still, the framerate was not where I wanted it. So, I went through code snippets I wrote, originally thinking they would be obvious optimizations, and confirm that they actually were. These areas were typically ByteArrays that were pre-allocated and used for matrix transformations and similar state settings. After all, a single call to glLoadMatrixf() is much faster than 4 calls to glLoadIdentity(), glTranslatef(), glRotatef(), and glScalef(). Right? Sadly, no. It became obvious very quickly that this wasn't the case. I'll just sit back now and let some sample code and timings do the talking....
transformThe above code should not take longer to execute than OpenGL calls, but it does. I'm sure OpenGL does some very good assembly level optimizations under the hood for matrix creation and multiplication, but the above code is extremely trivial. Opening up a Workspace and doing some simple timings shows some interesting results:
"Create a 2D matrix FAST!"
^viewTM
_11: x * scale x;
_12: y * scale y;
_21: y * scale x negated;
_22: x * scale y;
_41: origin x;
_42: origin y;
_43: z;
yourself
Time microsecondsToRun: [10000 timesRepeat: [The above results (on my machine) in a ~2500 microsecond timing. That's pretty damn good for a foreign function interface (FFI). Now, just faking a similar MATRIX setup and slamming it through...
GL
glLoadIdentity;
glTranslatef: 1.0 y: 1.0 z: 1.0;
glRotatef: 40 x: 0 y: 0 z: 1;
glScalef: 1 y: 1 z: 1]].
Time microsecondsToRun: [10000 timesRepeat: [This results in ~6500 microsecond. That's more than 2.5 times what's required using just OpenGL! I highly doubt that #glLoadMatrixf: has any significant overhead above any other external method call. But it's possible that there is GP fault checking due to passing a buffer to an external call. Hopefully not.
GL
glLoadMatrixf: (b
_11: 2.0; _12: 2.0;
_21: 2.0; _22: 2.0;
_41: 2.0; _42: 2.0;
bytes)]].
My assumption at the moment is that there is a good amount of VM overhead for #floatAtOffset:put: in the ByteArray class - most likely due to bounds checking and type coercion. If this is the case, hopefully I can convice Object Arts to add a kind of UnsafeByteArray (or at least some "unsafe" methods to ByteArray).
Given the above information, I made the code change over to just make straight OpenGL calls. I also created a GameColor class instead of using a ByteArray for calls to glColor3fv().
These changes netted another significant win and the same particle effect (with about 200 particles) in game is now running at a solid 350 FPS. That's about a 200 FPS gain since my last post. Not too shabby, but still more room for improvement.
Part 3: What's next?
I'm going to continue correspondence with Object Arts regarding these issues. The more I learn about the inner workings of Dolphin the better I'll be able to push it for the benefit of others. Likewise, hopefully I can convince them of the need for certain functionality and it'll be a win-win senario.
Beyond that, I'm in the process of trying to create a fixed-size pool implementation for particles. Garbage collections on individual particles is still happening far too often with how quickly particles are created and destroyed. Giving each emitter a pool of 200 or so particles that are never released until the emitter dies should yield another decent gain.
1 comment:
Curious: what's the timing on just
(b
_11: 2.0; _12: 2.0;
_21: 2.0; _22: 2.0;
_41: 2.0; _42: 2.0;
bytes)
without the GL call?
Post a Comment