We’ve posted X-Plane 11.32 release candidate two – it contains very few changes from release candidate one. If you think RC2 changed your framerate from RC1 (for better or worse), it is imaginary.
The big item for RC2 is that X-Plane now uses our own replicating METAR servers. NOAA weather was plagued by 404s when the server posted METARs didn’t meet the date/time scheme X-Plane expected. Our own server should serve the latest weather we have, whatever that is.
Over the last few weeks we have spent a tremendous amount of developer time investigating reports of instability, crashes and performance problems, and the results have been quite unsatisfying. We really haven’t found a series of smoking guns we could fix to improve stability. We have learned some things about X-Plane’s performance and stability though. The rest of this post gets into the weeds; if you tune out (and I won’t fault you if you do) the TL;DR is: please turn on our anonymous analytics, and click “send” if you get the crash report form. The more gathered data we get about crashes, the better shot we have at addressing the issues.
Crash Rates and Plugins
X-Plane’s overall crash rate (all causes is approximately 14% – which is to say, for all of our users using analytics, for every 100 times they launch X-Plane, the sim quits in a way we did not expect or want 14 times. This number has been remarkably stable – it’s not a ton different between 11.26, 11.30, 11.31 or the 11.32 beta.
The sim quitting on purpose because of bad content is not considered a crash. For example, if you load a DSF with a missing .ter file, the sim will refuse to proceed and quit. This failure mode is user hostile in that you can’t fly, but it’s not a crash – the refusal to load is the code working according to design. While I would like to make this code less hostile to users, it’s also worth noting that these cases are ones where the author of the scenery pack would have been able to fix this if they had loaded their own work even just once. That is, they are caused by an add-on that should not have shipped.
(There is an in-between area where an add-on is mis-installed because it has a library dependency that the user hasn’t met. This is a deployment problem that really needs to be solved, but it’s orthogonal to true app crashes.)
We categorize crashes into plugin and non-plugin crashes; starting with 11.32 we actually get a statistical picture of this. A crash is a plugin crash (and you see the “we crashed because of a plugin: XSquawkBox” or whatever) if the sim crashed while executing code on behalf of a plugin or inside the plugin on the main thread.
There are a bunch of cases where plugins do not get correctly tagged – in particular, we regularly see crashes on random worker threads spawned by plugins; since we don’t know whose thread it is, we can’t blame the plugin. For example, the FF A320 crashing inside CEF on a worker thread is registered as an X-Plane crash (and we see it in our auto-reporting view) but it’s not our code and there’s nothing we can do about it.
One thing I’d like to do in future patches is improve diagnostics. The rate of actual blamed plugin crashes appears so far to be quite low, and given that we do see uncaught plugin crashes in our data on a regular basis, I think this is a case where add-on authors can only fix what they can see. If we can attribute all plugin crashes to plugins, then the plugin authors can catch their own bugs. Better diagnostics also helps a user remove a troublesome add-on in the case where that would help.
We have a few cases where X-Plane hits an error condition and deals with it by crashing. This is pretty bad, drives up our crash rate, and is something we need to fix to be less user hostile. For example, if a PNG file is bad (either corrupt contents or the sector on disk that backs it has gone bad) then X-Plane’s response is usually to mysteriously crash. Besides being rude, the crash gives an end user no idea which file is bad, and thus no way to fix it. X-Plane ships with something like 9000+ PNG files and over 2500 DDS files, not counting add-ons, so if we don’t tell you which file is bad, you’re not going to find it by poking around.
FMOD sound bank incompatibilities is another example – if you have two aircraft sharing a byte-wise copy of the same FMOD data (e.g. command-D duplicate the C172) loaded at once – then our FMOD loading code fails and then registers as a crash. The code is working exactly as we designed it, but the design isn’t robust enough. As we’ve learned, in the real world users duplicate aircraft (and their FMOD packs) on disk all the time.
The crashes in this category are things where we’re being user-unfriendly and better code would make these problems go away or leave users with a way to actually fix them. But they’re not case of “weird stuff inside the sim blew up.”
Persistent Stability Problems
In the crash data we do also see a few persistent stability problems. There’s some kind of crash in the ATC system that we’ve seen for a very long time but we don’t know how to reproduce. It’s a case where if enough users do enough random stuff, we hit an edge case in the ATC system that isn’t handled correctly. The solution here is to embed more diagnostics at the crash site until we can understand it from the reports we get from users. Please hit “send” when you crash – don’t worry about filling in the fields – it’s the report itself that we need.
We also see a lot of crashes inside the OpenGL drivers, from all of NVidia, AMD and Intel. Because the IHVs don’t share symbols and source code with us, we really can’t tell what went wrong in these cases.
My hope is that with Vulkan we’ll have better options for in-driver crashes. With Vulkan, they redesigned the error checking model: error checking is a feature you enable (via a configuration option at app startup) that brings a layer of code in on top of the driver to check what the app is doing. With error checking off we get the fastest framerate, and with error checking on we get slow framerate (more error checking means more slow) but some really great diagnostics.
(To put this in perspective: when Sidney ran the Vulkan version of Airfoil-Maker on Linux with the wrong driver installed and no error checking, it rebooted his entire Window server! So no error checking really means: no error checking.)
Since error checking is optional and selectable when the app runs, we could put an option into X-Plane to run in “safe mode” – if a user is hitting persistent stability problems in the driver, that user could turn on validation and possibly capture an error in X-Plane itself that would otherwise just be “the driver crashed.”
Running Out of Memory
I’ve worked with a few users to try to track down the out of memory problems we’ve heard about, and there isn’t an obvious pattern here. Some users report running out of memory in 11.30, but when put back on 11.26, find that they still run out of memory. We get more out-of-GPU-memory complaints on 11.30, but that might be because a few cases in 11.26 that were crashes due to running out of memory now report the problem in an orderly way. In 11.26 they were just mysterious crashes with a “send” form.
In the cases we’ve seen, the user running out of memory was often…actually running out of memory – that is, the surprising thing is not the crash but that X-Plane ever worked at all on those settings. The fundamental problem we face is that we have no visibility into what the OpenGL driver is doing with GPU memory. The OpenGL tries to manage memory no matter how much we ask for, and if it fails, we don’t know what went wrong.
The good news is: we have much better options for Vulkan. With Vulkan, we manage memory, which means we know what’s going on, and we can take steps to avoid out of memory crashes. If we do run out of memory, it should be for much more obvious reasons. We’re still analyzing what we can do about memory with Metal, but the choices should still be better than OpenGL.
The only advice I can offer now if you are seeing persistent memory crashes is: turn your settings down or use less add-ons. If you push X-Plane to the limits of your hardware, it may work for a while and then fail.
Sidney has looked at a lot of performance data from users who reported low framerate, and in almost every case, the performance has been as-expected. The most common case we see is users with relatively low single-core CPU performance hitting low framerate at high rendering settings while their GPU is bored. To put some numbers on this, if your CPU’s single-core geekbench score is down around 2000, you are almost certainly CPU bound, way at the bottom of what’s okay for X-Plane, and a new GPU won’t help.
As of now, X-Plane cannot use large numbers of cores (e.g. a 32-core machine is useless) and gets only limited performance boosts for framerate with multiple cores under some circumstances. This is something we are working to change in the future, but it’s not going to change quickly. If you are looking to improve performance with hardware, single core speed is still the most important metric.*
In a few rare cases we saw performance that was disproportionately bad compared to what we’d expect from an old CPU. We’re trying to gather more data from these users but the case is rare enough that we haven’t gotten a useful report yet. If we find a smoking gun, we can act on it.
Like memory, Vulkan will help with diagnosing performance. With Vulkan, more of the code is written by us and the Vulkan code we run has very predictable timing. So when we get complaints about performance, we’ll be in a much better position to understand what is slow and why.
I’m actually not sure what the next patch will be, but we do have a bunch of bug fixes to 11.30 waiting to go out once we have stabilization under control. I also have a pile of bugs that I have not yet fixed that are high on my priority list where something in 11.26 stopped working in 11.30. So if you have filed a bug that’s not fixed, we have not forgotten about it – it’s either next to come out or possibly on the short list.
* Yes, we realize that this dependence on single core speed is bad. It’s just going to take time to move to Vulkan and then offload the single thread.