MAMEWorld Forums - EmuChat - Re: Final comments on poly.c

> I still wonder how Aaron Giles came up with it; did he try many
> alternatives and finish with this one? Did he just get it right the "first try"? Is
> he using some lore that is common knowledge to people who often write heavily
> multithreaded code?

It is a mix of a (reasonably) sound architecture combined with lots of trial and error. It took many many hours of tweaking to find the optimal mix of bucket size and other behaviors to achieve maximal throughput on a dual core system. I'm not too surprised you didn't find a significant speed up over what exists on a 2P system, though you had some good ideas.

> I find that on every system I have run the poly.c
> renderer on - a single-core Pentium M, a dual-core Pentium D, and a 4 core/8 thread
> Xeon, I get close to 100% CPU utilization of all cores. poly.c seems to scale very
> well.

Careful here, though. The 100% CPU utilization doesn't mean that it's particularly *efficient* at what it's doing. I think it's pretty close to maximally efficient on 2P systems, but on higher-end systems, I am certain that there is significant cache thrashing happening which is slowing things down. At the time I wrote poly.c, I only had a 2P system, but now that I have a 4Px2 system I've been thinking about looking at it again ... eventually.

You can run MAME with the -numprocessors flag to compare performance of different targets.

The other thing to keep in mind is that when a thread is done, it will spin in a tight loop looking for new work for a while before re-waiting, in order to avoid some of the scheduling overhead that comes from repeated signal/wait cycles (i.e., it takes some time between signalling a thread and waking it up, and that was significant enough that spinning in a loop looking for work made a big difference).

> Given that, I thought that maybe I could improve the efficiency by reducing the
> locking, especially the compare-and-swap instructions which are not super cheap in
> terms of their effect on the bus and cache.

Reducing locking is always good, but a bigger challenge is reducing the cache snooping hits. When one thread is done it goes looking for work elsewhere, and scans other work queues to find something, which forces a snoop of cache lines that are currently "owned" by other processors. This in turn boosts the CPU utilization because you're spinning while the transaction is completed, but you're not really accomplishing anything very quickly.

> 1. A thread that generates polygons. This is a single thread (it must be, the
> rendering function of poly.c would blow up if more than one thread tried to render
> polygons at the same time) that is part of the actual emulation of whatever game is
> being emulated, and it, in addition to whatever other CPU emulation and whatnot it is
> doing, at some point during each frame starts spitting out polygons. I don't know if
> it saves them all up to the end or if they are generated interleaved with lots of
> other CPU emulation activity; maybe it is dependent on the system being emulated.

Yes, it is dependent, though if the CPU generates polys along the way, it is better to get them started rendering before the CPU emulation is complete, since you have a whole other processor (or more) just waiting for something to do.

> 2. This is what I focused on when trying to
> optimize; I wanted to find a way to do the scanline generation and bucketizing in a
> different thread so that the main CPU emulation had that much less work to do.

It's not a bad idea, but I think it would really only help on systems with more than 2 CPUs. And as you point out later, it has a problem in that some systems need to know how many pixels are actually going to be rendered in a triangle, for timing purposes.

> Basically one CPU is going to run CPU emulation, plus generate polygons, plus split
> them up into buckets and schedule all of the work procs for those buckets. I thought
> that this might be the long pole so what I tried to do was to take the "scanline
> bucketing" operation off and put it in a separate thread.

The goal of poly.c is also that the CPU running the emulation would also be able to hop in and pick up some work as well when it was stalled waiting for the rendering to complete. So, depending on the system, you may or may not be spending time rendering as well on that thread.

> I also tried splitting the work up into tiles (where the screen is broken up into a
> grid, each square of the grid being handled by a different thread) instead of buckets
> (i.e. horizontal strips), but this was significantly more complicated and I never
> even got it to work correctly before giving up. I don't think it would have helped
> much because the buckets already spread the load around very well.

I think this idea has merit and was something I was eventually thinking of doing. It does make things more complicated. Right now, only tall polygons get split, while wide polygons are generally handled by one or two buckets.

Another thing to keep in mind is that a grid would reduce (in theory) contention on overlapping polygons. As the poly count goes up, there are more and more tiny little polygons, and so a much greater chance that each poly would fit in a small grid tile without contending with a small polygon to the left or right.

> 2. Another place I tried to optimize was in reducing the number of calls to the
> scanline callback function (that actually does the work of rendering the scanlines)
> by, instad of one call per scanline, making the call once per group of scanlines and
> having the callback itself iterate over the scanlines.

Yes, I tried this as well and it never made much difference, so I kept it simple.

In terms of usage patterns, the key ones are:

* Namco System 22 (e.g., ridgerac) - very light CPU emulation overhead, dominated by fairly simple rendering
* Gaelco 3D (e.g., radikalb) - heavy CPU emulation overhead, lots of fairly simple rendering
* Voodoo (e.g., gauntleg) - heavy CPU emulation overhead, complex rendering pipeline, lots of little polygons

My opinion is that all of these can still seem improvements over today's implementation on 4P and greater systems. But it's going to be challenging to make it work well. It's going to be all about cache management, making sure that each thread snoops work in a particular pattern that reduces the amount of time spent contending. Perhaps adding some prefetch opcodes to parallelize some of the fetching (this is a simple thing to experiment with), both during processing and during rendering.

On top of it all, I think the concepts of poly.c (e.g., aggressively managed work queues with buckets that must be executed in order) should be generalized so that it's not just for rendering. For example, the discrete sound generation code today is nearly capable of splitting up the work into parallelizable chunks, and I have a long-standing promise to couriersud to generalize the poly.c engine to be able to handle that as well.

Aaron

P.S. Sorry I missed your other posts. I don't often have time to write this extensively.