I’m pretty gun-shy about posting new features to this blog before they are released. One reason is that a fair number of the things I code never make it into the final X-Plane because they just don’t perform as expected. But the converse of that is: there should be no problem posting about what failed.
One idea that I believe now will not make it into the sim is dual-core pipelined rendering. Let me clarify what I mean by that.
As I have
blogged before, object throughput is one of the hardest things to improve in X-Plane. That code has been tuned over and over, and it’s getting to be like squeezing water from a rock. That’s where dual-core pipelined rendering comes in. The idea is pretty simple. Normally, the way X-Plane draws the frame is this:
for each object
is it on screen?
if it is tell the video driver, hey go draw this OBJ
Now the decision about whether objects are on screen (culling) is actually heavily optimized with a quadtree, so it’s not that expensive. But still when we look at the loop, one core is spending all of its time both (1) deciding what is visible and (2) telling the video driver go draw the object.
So the idea of the pipelined render is to have one core decide what’s on screen and then send that to another core that talks to the video driver. Sort of a bucket-brigade for visible objects. The idea would be that instead of each frame taking the sum of the time to cull and draw, each frame should take whichever one is longer, and that’s it.
The problem is: the idea doesn’t actually work very well. First, the math above is wrong: the time it takes to run is the time of the longer process plus the waiting time. If you are at the end of a bucket brigade putting out the fire, you waste time waiting until that first bucket goes down the line. In practice the real problem though is that on the kinds of machines that are powerful enough to be limited only by object count, the culling phase is really fast. If it takes 1 ms to cull and 19 ms to draw, and we wait for 0.5 ms, the savings of this scheme is only 2.5%.
Now 2.5% is better than nothing, but there’s another problem: this scheme assumes that we have two cores with nothing to do but draw. This is true sometimes, but if you have a dual-core machine and you just flew over a DSF boundary, or there are heavy forests, or a lot of complex airports, or you have paged-texture orthophoto scenery, then that second core really isn’t free some of the time, and at least some frames will pick up an extra delay: the delay waiting for the second core to finish the last thing it was doing (e.g. building one taxiway, or one forest stand) and be ready to help render.
And we lose do to one more problem: the actual cost of rendering goes up due to the overhead of having to make it work on two cores. Nothing quite gloms up tight fast inlined code like making it thread-safe.
So in the end I suspect that this idea won’t ever make it into the sim…the combination of little benefit, interference by normal multi-core processing, and slow-down to the code in all cases means it just doesn’t quite perform the way we hoped.
I am still trying to use multiple cores as much as possible. But I believe that the extra cores are better spent preparing scenery than trying to help with that main render. (For example, having more cores to recompute the forest meshes more frequently lowers the total forest load on the first CPU, indirectly improving fps.)
If you run X-Plane 9.21 (or 9.22) on a Macintosh with an old ATI or nVidia graphics card (with no pixel shaders), you somehow squeeze 25 fps out of X-Plane*, and you can try a test build, please email me.
Those cards include:
- Radeon 7000-9200, inclusive.
- GeForce 2, 3, or 4 series.
I have a change in the panel code that I need to performance test against older hardware!
* Basically you would have to really crank the settings down – but I think under some really baseline settings these machines might be able to run X-Plane 9 without fogging.
Posted in News
by
Ben Supnik |
It looks to me like we could afford a few landing light halos on most (but not all) hardware. This gets a bit tricky in terms of how we make this available to authors…
- We have to allow access without breaking old planes.
- There will be two distinct cases due to very different hardware.
So…I have posted an RFC on the X-Plane Wiki. Please post your thoughts on the discussion page!
One option (not really discussed in the RFC) is to do nothing at all. Basically I hit upon this during some routine refactoring of the shaders. The whole issue can be deferred indefinitely.
Why wait? Well, I don’t believe that an incremental increase in the number of landing light halos is the future. Our end goal must be some kind of truly global illumination, hopefully without a fixed lighting budget. It may not make sense to add a bunch of complexity to the aircraft SDK only to have all of those limit become unnecessary cruft a short time later.
(I think I can hear the airport designers typing “why do the airplane designers get four lights and we get none? Give us a light or two!” My answer is: because of the fixed budget problem. We can allocate a fixed budget of lights to the user’s aircraft because it is first in line – we know we either have the lights or we don’t. As soon as we start putting global lights in the scenery, we have to deal with the case where we run out of global lights. For scenery I definitely want to wait on a scheme that isn’t insanely resource limited!)
Programmers: yes – Dx10 hardware can do a hell of a lot more than 4 global lights. Heck – it can do a hell of a lot period! For example, it can do deferred rendering, or light pre-rendering. A true global lighting solution might not have anything to do with “let’s add more global lights a few at a time.”
Every time I work on a new X-Plane feature, I do a combination of:
- Reorganizing and cleaning up old code.
- Adding new features.
- Tuning performance for this new environment.
My experience has been that the investment in cleaning up old code is more than paid for by faster, easier development of new code – it’s easier to code in a “clean” work area.
As part of my work on 930 I am refactoring and optimizing how we set up pixel shaders. I’m not sure if there will be any framerate benefits in the short term, but in the long term there is definitely an advantage to being able to set up the most optimal shader configuration for any situation.
(Since most of what we draw – OBJs, airplanes, DSFs) can be created by users, we never really know what we’ll be drawing…the set of art content X-Plane can handle is almost unlimited. So it is up to shader optimization code to “find” the optimal setup for a particular stew of OBJ attributes, textures, etc.)
The short term fall-out during beta is unfortunately a certain amount of pain. It’s likely that these changes will introduce graphic quirks with certain combinations of planes. These are fixable! The important thing is: if you hit a graphics bug with a particular plane or scenery pack in 930 (whenever we get to beta – we are not in beta yet!) and that bug is not in 921 – report it! It may be that the optimizer is being too aggressive with a particular combination of settings and turning off some critical feature.
I will run the new shader optimizer code through just about every scenery pack and airplane I can find, but invariably there is some magic trick in a third party plane on the .org that I won’t have.
One thought for creating fast content: alpha is expensive! Or rather, let me rephrase that to: if you are not using the alpha channel of your texture, you should not have an alpha channel in your texture.
(For PNG this means stripping the alpha channel off, rather than having a solid 100% opaque alpha channel. For DDS this means using DXT1 with no transparent pixels.)
The new shader optimizer detects the case where alpha is not being used and sets up a more optimal code path. (The old shader optimizer did that too, but only some of the time – in the new code, we will always take this optimization.)
Having alpha blending enabled can inhibit “early-Z” optimizations on modern GPUs, and also require a more expensive blending operation in the framebuffer.* So if your model doesn’t use alpha, strip the channel.
* Some newer graphics cards recognize 100% opaque alpha and provide fast write to the framebuffer. But even if early-Z-type optimizations become alpha friendly, there will still be optimizations we can make in the sim if we hit the no-alpha case.
Some coding problems are stubborn – I find myself looking back at a week of working realizing that all I really did was prove that a bunch of theoretical improvements don’t work in practice.
Improving OBJ throughput is one of those problems. On a high-end machine, even drastic changes to the OBJ engine make only the slightest difference in throughput – 2 or 3% at best. Every improvement counts, but a 3% improvement doesn’t change the game for how we draw scenery.
There is at least one route I haven’t had time to go down yet: object instancing. The theory is that by making many objects appear with only one object drawn, we get a multiplier, e.g. a 2x or 4x or larger amplification of the number of objects we can have.
In practice it won’t be that simple:
- To get such an amplification we have to recognize groups of the exact same object. Grouped objects will have to be culled together. So we might get a hit in performance as we draw more objects that are off-screen, just to use instancing.
- It may be that the grouping requirement is so severe that it is not practical to find and group objects arbitrarily (instead we would have to group objects that are built together, like clusters of runway lights). That might limit the scope of where we can instance.
- The objects have to look more or less the same, so some categories of very complex objects won’t be subject to instancing. (E.g. objects with animation where each object might look different.)
- I have already coded some experiments with geometry shaders, and the results are just dreadful – geometry shaders simply don’t output a huge number of vertices efficiently, so they don’t help us increase our total vertex throughput. The experience has left me with a “prove it” attitude toward GL extensions that are supposed to make things faster.
When will we know whether instancing can help? I don’t know — I suspect that I won’t be able to find time for code experiments for a bit, due to other work, particularly on scenery creation and tools.
A hidden detail of my previous post on variation and terrain textures: variation for flat textures was implemented using more triangles in the DSF in X-Plane 8, but is implemented in a shader in X-Plane 9. This means that you don’t get this feature in X-Plane 9 if shaders are off.
My guess is that this is perfectly acceptable to just about every user.
- If you don’t have shaders, you have something like a GeForce 4 or Radeon 8500, and are fighting for frame-rate. In this case, not paying the price of layer-based variation is a win.
- If you have shaders, you’re getting better performance because the shader creates variation more efficiently than the old layering scheme did.
This kind of move of a feature to the GPU can only happen at major versions when we recut the global scenery, because (to utilize the benefit) the DSFs are recut with fewer (now unneeded) layers. So features aren’t going to mysteriously disappear mid-version.
I do have a goal to move more layering-type features to the GPU for future global scenery renders. There are a number of good reasons:
- DSF file size is limited – we have distribution requirements on the number of DVDs we ship. So DSF file size is better spent on more detailed meshes than on layers.
- GPU power is increasing faster than anything else, so it’s good to put these effects on the GPU – the GPU is still hungry for more!
- If a feature is run on the GPU, we can scale it up or down or turn it on or off, for more flexible rendering settings on a wide variety of hardware. A feature baked into the DSF is there for everyone, no way to turn it off.
My hope for the next render is to (somehow) move the cliff algorithm (which is currently done with 2-4 layers) to the GPU, which would shrink DSFs, improve performance, and probably create nicer looking output.
The short answer is: this is not a very good idea.
Now with OS X, this configuration is supported, and OS X will cleverly copy graphic output from one video card to another to make the system work well. You will get a fps hit when this happens.
With Vista, this configuration isn’t supported. (Snarky comment: it is lame that Microsoft completely rewrote their video driver infrastructure and went backward in terms of configuration support.)
With Linux, I have no idea if this configuration can run. I do know that trying to change my configuration hosed Ubuntu thoroughly and I decided not to break my Linux boxes any more, having spent plenty of time doing that already in the last few days.
For X-Plane, we can’t handle this case very well (at best you get the framerate hit) because we need to share textures between the IOS screen and main screen. So if you are trying to set up an IOS screen, you really do need a dual-headed graphics card. For what it’s worth, every card I’ve gotten in the last few years has had two video outputs.
My Mac Pro has just gotten weirder – I put a Radeon HD 3870 into the second PCIe x16 slot. (The machine comes with a GeForce 8800.) I now have one monitor in each.
So here’s where things get fun:
- Start X-Plane. 60 fps.
- Drag the window to the second monitor. 30 fps.
- Quit, move the menu bar to the second monitor, restart. (X-Plane is now on the right.) 160 fps.
- Drag the window back to the primary monitor on the left. 100 fps.
What’s going on? Two things:
- On OS X, X-Plane’s graphics are rendered by one video card, and that video card (in 921) is the card that has the menu on one of its monitors.
- When an OpenGL window is displayed on a monitor that is not attached to the video card that is doing the rendering, OS X will copy the image from one video card to another, at a cost of some framerate.
So what’s going on above? Well, the 60 fps is my 8800. When I drag the window, the OS starts copying the graphics, slowing fps. When I move the menu bar, the 3870 does the rendering, and we get much higher fps. Once again, put the window on the monitor that is not attached to the video card, and fps hit.
Final note: fps tests of the 8800 vs 3870 with X-Plane 921:
Fps test 2, 8800: 46,49,51
Fps test 2, 3870: 70,75,80
Fps test 3, 8800: 24,25,25
Fps test 3, 3870: 40,41,43
In other words, the 3870 is significantly faster. I believe that this is due to the OS X drivers, not the cards themselves. Note that the 3870 is in a PCIe 1.0 slot and the 8800 is in a PCIe 2.0 slot.
I think we’ve reached the point where, if you are putting together a new computer and have X-Plane in mind:
- Get a quad-core machine if the pricing is favorable (and I think it should be now).
- Get a “Direct X 10” compatible graphics card. That would be an nVidia 8, or 9 series (or I guess that crazy new 280 card) or a Radeon HD 2000/3000/4000. DX10-type cards can be had for $100 to $150.
Quad core is easy: X-Plane 921 will use as many cores as yo have for texture loading (especially in paged scenery), uses two cores all the time, and uses 3 during DSF load. The infrastructure for this additional scalability (previous builds used two cores, more or less) will let us put 3-d generation on 4 cores or more. More on this in another post, but basically X-Plane’s utilization of cores is good and getting better, so four cores is good, particularly if it’s not a lot more expensive.
Now for DX10, first I have to say two things:
- We don’t use DirectX. We have no intention of switching to DirectX, dropping OpenGL support, or dropping OS X/Linux support. I just say “DX10” to indicate a level of hardware functionality (specified by Microsoft). The DX10 cards have to have certain hardware tricks, and those tricks can be accessed both in OpenGL and Direct3D. We will access them by OpenGL.
- We are not going to drop support for non-DX10 cards! (We’re not that crazy.)
X-Plane does not yet utilize those new DX10 features, but the DX10-compatible cards are better cards than the past generations, and are now affordable*. By making sure you get one of these, you’ll be able to use new graphic features when they come out.
* The roll-out of DX10 cards has been similar to DX9. With the first generation cards there was one expensive but fast card and one cheap but slow card. With DX10, NVidia got there first, with DX9 ATI did. Like a few years ago, now that we’re a few revs into the new spec, both vendors are making high quality cards that aren’t too expensive.
Yesterday I described how triangles and meshes can be optimized and hypothesized that building OBJs carefully could improve vertex throughput. Having looked at some numbers today, I think the potential for framerate improvement isn’t that great…an improvement would come from cache utilization (post vertex shader), and our cache usage seems to be pretty good already.
Simulating a FIFO vertex cache with 16 vertices (an average number – very old hardware might have 8 or 12, and newer hardware has at least 24 slots) I found that we miss the cache preventably around 15% of the time (using a random set of OBJs from LOWI to test) – sometimes we really missed bad (20-25%) but a lot of the time the miss rate might be as low as 5%.
What these numbers mean is that at the very best, index optimizations in OBJs to improve vertex throughput might only improve vertex processing by about 15% (with the FPS improvement being less, since vertex throughput isn’t the only thing that slows us down).
In other words, if I solve the cache problem perfectly (which may be impossible) we get at best 15%.
So this could be a nice optimization (every 5% win counts, and they matter if you can improve fps by 5% over and over) but cache utilization isn’t going to change the nature of what you can model with an OBJ, because our cache utilization is already pretty good.
Have a Happy Thanksgiving!