I have spent almost the entire last week looking at ATI performance on Windows in depth; this post summarizes some of my findings and what we are now working on. Everything on this post applies to ATI hardware on Windows; the ATI Mac driver is a completely separate chunk of code.
Forgive me for quoting an old post, but:
As with all driver-related problems, I must point out: I do not know if it’s our fault or ATI’s, and the party at fault may not be the party to fix it. Apps work around driver bugs and drivers work around illegal app behavior all the time…that’s just how it goes. Philosophically, the OpenGL spec doesn’t require any particular API calls to be fast, so you can’t really call a performance bug a bug at all. It’s ATI’s job to make the card really fast and my job to speculate which subset of OpenGL will cause the card to be its fastest.
This proved to be true – most of the ATI performance problems on Windows involve us leaning heavily on an API that isn’t as fast as we thought it was, but that’s not really a bug, it’s just the particular performance of one driver we run on. The solution is to use another driver path.
Cloud Performance
I’m going to try to keep this post non-technical; if you are an OpenGL nerd you can read more than you ever wanted to know here.
With 100,000 cloud puffs (this is a typical number for one broken layer, looking at an area of thicker clouds) we were seeing a total cost of about 9 ms to draw the clouds if we weren’t GPU bound, compared to about 2 ms on NVidia hardware and the same machine.
Is 7 ms a big delta? Well, that depends on context. For a game developer, 7 ms is a huge number. At 15 fps, saving 7 ms gets you to 16.7 fps, but at 30 fps it takes you up to 37 ms. That’s one of the crazy things about framerate – because it is the inverse of how long things take, you get bigger changes when the sim is running faster. For this reason I prefer to think in milliseconds. If we can get a frame out in 20 ms we’re doing really good; if it starts to take more than 50 ms, we’re in real trouble. You can think of 50 ms as a budget, and 7 ms is 15% of the budget – a number you can’t ignore.
The ATI guys pointed us to a better way to push the cloud data through to the card, and the results are better – about 3 ms for the same test case. That should make things a bit better for real use of the sim, and should get clouds out of the “oh sh-t” category.
Now there is one bit of fine print. Above I said “if we weren’t GPU bound”. I put the sim through some contortions to measure just the cost of the geometry of clouds, because that’s where ATI and NV cards were acting very differently. But for almost anyone, clouds eat a lot of fill rate. That fill rate cost is worse if you crank the rendering setting, run HDR, run HDR + 4xSSAA, have a huge monitor, or have a cheaper, lower compute-power card. So if you were CPU bound, this change will help, but if you don’t have enough GPU power, you’re just going to be blocked on something else.
(A good way to tell if you are fill rate bound: make the window bigger and smaller. If a smaller window is faster, it’s GPU fill rate; if they’re the same speed it’s your CPU or possibly the bus.)
At this point I expect to integrate the new cloud code for ATI Windows into the next major patch.
Performance Minus Clouds
I took some comprehensive measurements of framerate in CPU-bound conditions and found that with the “penalty” for the existing clouds subtracted out of the numbers, my machine was about 5% faster with NV hardware than ATI hardware. That may represent some overall difference in driver efficiency, or some other less important hardware path that needs tuning. But the main thing I would say is: 5% isn’t that much – we get bigger changes of performance in routine whole-sim optimization and they don’t affect all hardware in the same way. I have a number of todo items still on my performance list, so overall performance will need to be revisited in the future.
The Cars
The other code path in the sim that’s specifically slower on ATI cards is the cars, and when I looked there, what I found was sloppy code on my part; that sloppy code affects the ATI/Windows case disproportionately, but the code is just slow on pretty much any hardware/OS combination. Propsman also pointed me at a number of boneheaded things going on with the cars, and I am working to fix them all for the next major patch.
So my advice for now is to keep the car settings low; it’s clear that they are very CPU expensive and it’s something I am working on.
Fill Rate
One of the problems with poor CPU performance in a driver is that you never get to see what the actual hardware can do if the driver can’t “get out of the way” CPU-wise, and with clouds having a CPU penalty, it was impossible to see what the Radeon 7970 could really do compared to a GTX 580. Nothing else creates that much fill rate use on my single 1920 x 1200 monitor.*
I was able to synthesize a super-high fill-rate condition by enabling HDR, 4x SSAA, full screen, in the 747 internal view. This setup pushes an astonishing number of pixels (something that I am looking to optimize inside X-Plane). I set the 747 up at KSEA at night so that I was filling a huge amount of screen with a large number of flood lights. This causes the deferred renderer to fill in a ton of pixels.
In this “no cloud killer fill” configuration, I was able to see the 7970 finally pull away from the 580 (a card from a previous generation). The 7970 was able to pull 13.4 fps compared to 10.6 fps, a 26% improvement. Surprisingly, my 6950, which is not a top-end card (it was cheaper than the 6970 that was meant to compete with the 580) was able to pull 10.2 fps – only 4% slower for a significantly lower price.
In all cases, this test generated a lot of heat. The exhaust vent on the 7970 felt like a hair dryer and the 580 reached an internal temperature of 89C.
CPU Still Matters
One last thing to note: despite our efforts to push more work to the GPU, it’s still really easily to have X-Plane be CPU limited; the heavy GPU features (large format, more anti-aliasing, HDR) aren’t necessarily that exciting until after you’ve used up a bunch of CPU (cranking autogen, etc). For older CPUs, CPU is still a big factor in X-Plane. One user has an older Phenom CPU; it benches 25-40% slower than the i5 in published tests, and the user’s framerate tests with the 7950 were 30% slower than mine. This wasn’t due to the slightly lower GPU kit, it’s all in the CPU.
The executive summary is something like this:
- We are specifically optimizing the cloud path for ATI/Windows, which should close the biggest performance gap.
- We still have a bunch of performance optimizations to put in that affect all platforms.
- Over time, I expect this to make ATI very competitive, and to allow everyone to run with “more stuff”.
- Even with tuning, you can max out your CPU so careful tuning of rendering settings really matters, especially with older hardware.
* As X-Plane becomes more fill-rate efficient it has become harder for me to really max out high-end cards. It looks like I may have to simply purchase a bigger monitor to generate the kind of loads that many users routinely fly with.