MAMEWorld Forums - EmuChat - Final comments on poly.c

Please note this is likely only interesting to developers, and probably not even interesting to them, so don't bother reading further if you're not a dev; I don't mean to waste your time; thanks ...

I'm only writing this because I spent 60+ hours trying to improve the performance of the polygon rendering in poly.c and I was not able to make a single positive improvement; so with nothing to show for the time spent I thought I'd at least post this to save anyone else trouble in trying to optimize this code.

First off, the algorithm as it is written is very good. It's fiendishly simple and yet not obvious (at least not to me) that it's the best way until you start trying alternatives. I still wonder how Aaron Giles came up with it; did he try many alternatives and finish with this one? Did he just get it right the "first try"? Is he using some lore that is common knowledge to people who often write heavily multithreaded code?

First off - my presumption that there are inefficiencies in the threading with poly.c were wrong, I think. When I isolated that code into a small benchmarking program that runs nothing else other than random polygon renders (no CPU emulation, just throwing a bunch of random triangles up on the screen - and not even drawing anything, just generating the polygon scanlines and then running some fake 'work' over them to simulate drawing them on a screen), I find that on every system I have run the poly.c renderer on - a single-core Pentium M, a dual-core Pentium D, and a 4 core/8 thread Xeon, I get close to 100% CPU utilization of all cores. poly.c seems to scale very well.

Given that, I thought that maybe I could improve the efficiency by reducing the locking, especially the compare-and-swap instructions which are not super cheap in terms of their effect on the bus and cache.

Let me describe the way that I understand the poly.c implementation. There are three major pieces:

1. A thread that generates polygons. This is a single thread (it must be, the rendering function of poly.c would blow up if more than one thread tried to render polygons at the same time) that is part of the actual emulation of whatever game is being emulated, and it, in addition to whatever other CPU emulation and whatnot it is doing, at some point during each frame starts spitting out polygons. I don't know if it saves them all up to the end or if they are generated interleaved with lots of other CPU emulation activity; maybe it is dependent on the system being emulated. I don't know - perhaps I should have investigated to find out, because maybe it would have affected the optimizations I tried. But at any rate, my benchmark program allows me to simulate any amount of CPU work I want to in between spitting out polygons and I couldn't make my version of poly.c faster than the original poly.c no matter how much work was simulated. But I digress.

2. The same thread that generates polygons runs through a routine that generates spans of scanlines (i.e. horizontal lines comprising the rendered polygon) and puts each span in a 'bucket'. Buckets are defined as being 8 scanlines high; imagine your screen being split into horizontal strips, each one 8 lines high, and you have a good idea what the buckets are. As the scanlines for each polygon are generated, they are added to whatever bucket they fall into; so if a polygon is very tall it might cover 10 buckets (i.e. 80 pixels) and its scanlines will be put into groups of 8 into each appropriate bucket. Keep in mind that all of this happens in the same thread that generates the polygons so if the CPU emulation is the slow part of the game, then all of this will only add to the slowness. This is what I focused on when trying to optimize; I wanted to find a way to do the scanline generation and bucketizing in a different thread so that the main CPU emulation had that much less work to do.

3. Once the scanlines for a polygon are generated and put onto the *end* of the appropriate buckets (note that they have to go on the end, so as to be rendered *after* previous polygons), 'work procs' are scheduled for the scanline groups thus added to each bucket, one 'work proc' per scanline group. This means that whatever threads that are available will immediately wake up and start processing these scanline groups, one per group. At this point, the screen is now handled as if it were N screens, where N is the number of buckets that the screen has been split up into. They are all completely independent; polygons that span buckets will have their bucket segments handled in parallel. This is where the parallism of poly.c comes into play. It treats the screen as N horizontal strips instead of one single big 'strip', and each strip can be handled completely independently of the others. This allows multiple threads to work on the screen simultaneously.

4. As each thread wakes up and starts processing the scanline group, there is a potential ordering issue: we can't guarantee what order the buckets will be processed in (multiple threads all running at once can be scheduled in an arbitrary order). But we can't let a scanline group from a polygon that came 'after' a scanline group in the same bucket from a prevous polygon be rendered before the other one, so the work proc for scanline groups uses some tricky compare-and-exchange instructions to ensure that whatever scanline group came just before this one in the bucket that it's in has already been completely rendered, and if it has not, adds the current scanline group to the end of the list of scanline groups that the previous scanline group is on. So in effect, either we ('we' being the thread that is handling a scanline group) render the scanlines, or if we can't because there was a polygon that came before us that hasn't been rendered yet, we add our scanlines to those so that whatever thread is about to them will do both theirs and ours. This is a very elegant algorithm - "I'll do the work if I can, but if I can't, I'll give the work to someone who can" - and it is effective at keeping all CPUs busy all of the time.

OK so there are a few things that I tried to focus on to improve the efficiency:

1. The major thing is to try to identify the 'long pole' on every frame and try to split that long pole up so that the amount of work each processor is doing is more evenly balanced. Basically, a frame can only be finished when its long pole is finished, so if you have a pole significantly longer than the others, you want to split some of the work off of it and distribute that among the other processors, shortening the long pole and lengthening all of the others. The ideal is for every processor do to the same amount of work each frame, although that's not exactly possible we can get pretty close.

Basically one CPU is going to run CPU emulation, plus generate polygons, plus split them up into buckets and schedule all of the work procs for those buckets. I thought that this might be the long pole so what I tried to do was to take the "scanline bucketing" operation off and put it in a separate thread. In the end what I found was that the thread scheduling overhead was costlier than the amount of extra parallelism achieved. The amount of work that goes into generating and bucketizing the scanlines is quite small compared to actually rendering the resulting scanlines, and small compared to the threading overhead thus introduced.

I also tried splitting the work up into tiles (where the screen is broken up into a grid, each square of the grid being handled by a different thread) instead of buckets (i.e. horizontal strips), but this was significantly more complicated and I never even got it to work correctly before giving up. I don't think it would have helped much because the buckets already spread the load around very well.

2. Another place I tried to optimize was in reducing the number of calls to the scanline callback function (that actually does the work of rendering the scanlines) by, instad of one call per scanline, making the call once per group of scanlines and having the callback itself iterate over the scanlines. This also had no significant positive effect - the cost of making 8 function calls (through a function pointer) is absolutely lost in the noise of actually doing all of the work on each pixel of the scanline within the callback.

3. Finally, I tried a varienty of small optimizations in the scanline generating function (re-using variables where possible, optimizing out some tests, etc). This also had negligeable effect because once again this is all overshadowed by the work done per-pixel in the scanline rendering function.

OK so I officially admit to being beaten. I *so* wanted to make some kind of awesome improvement to the polygon rendering code to make it faster, but the code as it is is already just about as fast as it's going to get.

Of course, someone smarter than I am is probably going to come along and prove me wrong by making the polygon rendering code faster ...

EDIT

Oh - and I forgot one more complication. There is actually one video system that *requires* that the scanline bucketing all happen in the main CPU thread and not farmed off to other threads - the 'voodoo' video system. For reasons unknown to me (good reasons I am sure, but that code is all greek to me) the main CPU thread that generates polygons has to get a count back of the number of pixels that each polygon occupies, and that means calculating that in the main CPU thread, which means doing the scanline conversion and bucketing in the main thread. Even if I had managed to make doing the scanline bucketing in a separate thread faster than doing it in the main thread, I would have had to have a 'special' implementation like the existing one just to use for voodoo.c. That was a pretty unsavory proposition.

Edited by Bryan Ischo (09/23/10 02:12 AM)