One of the most well known errors you will encounter in X-Plane is the infamous device loss error. But it’s also one of the least well understood errors and one of the hardest to debug, so in this blog post I want to talk about what device losses are, what we have done to fix them and what you can do to help us investigate these issues. Let’s actually start with the latter because if there is one takeaway from this blog post for anyone, it’s the following:
In case of device loss
If you encounter a reliable device loss error, please run X-Plane with --aftermath
in the command line options, or run X-Plane via the X-Plane_aftermath.bat
script from the Support folder inside of your root X-Plane folder. This flag currently works with both Nvidia, AMD and Intel hardware, but it will come with a bit of a performance hit. It’s best used in cases where you already know you are going to run into a device loss and want to provide additional diagnostics. I would not recommend just running with this flag, in particular because it does not work around device losses in any way – It just gathers additional data to make device loss debugging easier. When running with Aftermath, next time you encounter a device loss, X-Plane will no longer call it that but instead say “Encountered a GPU crash!”. Please submit those crash reports! If you see this message it means that X-Plane was able to gather diagnostic data that can help identify what happened.
What is a device loss anyways?
Everyone knows that applications can crash. The human species is incapable of writing bug free code and some of these bugs lead to crashes, while others just result in broken logic ranging from mildly entertaining to downright annoyingly broken. But crashes are definitely the worst kind of bug: If you have invested 7 hours into a flight and then suddenly X-Plane crashes, it’s probably one of the worst experiences in flight simming. I know it has put me off from virtual flying for weeks in the past. Luckily, when applications crash, it’s possible to capture the current state of execution and everything that lead up to it. As programmers, we get to see the callstack which tells us not only in which function we are currently in and where, but also what function has called it and in turn which has called that function and so on. We can also see what thread executed it, what the other threads were up to and even the state of the CPUs internal registers that hold part of the current program state. While this isn’t always enough to create a fix, it points in a very specific direction and lets us investigate the code and, together with the log, piece together what happened. Essentially, when you submit a crash report, you send us the black box of your flight. It contains forensic information that can be analyzed to figure out what happened, but we don’t look as cool as NTSB investigators.
Device losses in reality are just crashes as well, but crashes on the GPU. GPUs these days, and really since a long time, are almost fully programmable and just execute arbitrary code. Programs that execute on a GPU are called shaders, a leftover from the olden days when they could exclusively be used to shade vertices. But these days, shaders are just tiny programs that execute on potentially tens of thousands of GPU cores. Shaders are responsible for basically everything that X-Plane puts up on the screen, from mundane tasks like transforming vertex data into screen space, calculating the colour of every single pixel all the way to culling tens of thousands of trees per frame. X-Plane makes use of a lot of shaders because GPUs are incredibly flexible and a lot of workloads lend themselves really well to their highly parallel nature. Counting shaders is surprisingly hard, X-Plane bundles shaders into what we call “modules” which contain variants of similar shaders. But doing a count of just modules, X-Plane 11 shipped with 29, while X-Plane 12.1.3 had 88 modules and 12.2b4 ships with 91. Each shader module contains between a handful to thousands of distinct shader variants, so the real runtime count of compiled shader programs is much higher, somewhere in the mid to high 5 digit range. Quantity is of course not a measure of quality, but X-Plane 12 as well as any modern game wouldn’t be possible without the ability to flexibly run millions of shader invocations every single frame.
Just fix the crashes then
While shaders are no doubt one of the greatest additions to computer graphics, they come at a cost: They are incredibly hard, near impossible, to debug properly. On the CPU side you can attach debuggers to your code and then step through the code as it executes while seeing the result of each operation. On the GPU that is not a possibility, you can’t inspect the state of a running shader, only the results of the execution after the fact. And if nothing shows up on screen, then good luck trying to figure out where your triangles went. And if it crashes, you are even more toast because now you only got broken pieces to look at left. There are tools to make this a bit easier and it definitely isn’t as bad as it used to be, which is a big reason why it’s possible to put a lot of shaders into the field now: The tooling around this is has gotten a lot better. But for crashes, the GPU once again turns into a terrible blackbox.
The first hurdle is that the CPU and GPU run asynchronous from each other. The CPU encodes operations for the GPU into a command buffer that is then submitted to the GPU for execution, so right there you already have a source of latency. In practice, the CPU is usually at least a frame ahead of the GPU in terms of what it is computing. Detecting a crash also involves latency, often the OS/driver has to recover the GPU from its current state and put everything back together. Since the GPU is responsible for getting anything onto the screen, the OS and driver are much more interested in getting you back into a state where you can interact with your computer. But eventually, X-Plane will catch up to the fact that the GPU crashed, because at some point a call into Vulkan will produce the infamous error code VK_ERROR_DEVICE_LOST
. This is why it will often feel like your computer is glitching out prior to a device loss, everything stops working for a second, the OS and driver recover the GPU, your windows and displays might flicker around for a second and only then will X-Plane go “yeah by the way, stuffs broken, yo”.
This huge latency between cause and effect is one of the big issues with device loss debugging. By the time the CPU side of X-Plane realizes the GPU is dead, it’s too late to gather data anymore: Something we did god knows how long ago crashed and that’s all the information we have. This is also why the log file tends to just not be very useful for device loss triage. This is especially bad when the device loss is hard to reproduce and happens only once every blue moon. My favourite kind of device loss is the one where you can get X-Plane crash really fast and reliable, in those cases it’s possible to start toggling things on and off and compare what happens to zero in on what is the actual cause. I have spent many a day having X-Plane take down my window manager over and over again.
What’s being done about device losses then?
Over the years we’ve fixed a bunch of device losses. 12.06 was probably the biggest release here, cutting device losses by about 75%. 12.1.0 also reduced device losses by a large fraction by working around a bug in the Nvidia driver. But of course new code gets written and new code can always be buggy. Way back in 11.50 we also added support for Aftermath, which is a library from Nvidia that helps gather crash dump data off of the GPU. In 12.2, I have massively reworked Aftermath support and also added support for AMD GPUs. While AMD does not actually use Nvidia’s Aftermath library, for simplicity sake the same command line option is re-used and my hope is to also add support for this for Intel GPUs in the future Edit: It turns out, Intel supports the AMD extension used for this, so Intel GPUs are supported as well.
With 12.2 and Aftermath enabled, X-Plane now injects per draw/dispatch checkpoints into the command stream. In the event of a device loss, these can then be analyzed to recover the GPUs program state. Because it is very fine grained, this data is incredibly valuable and has actually helped fix two device losses already. The downside is that, because it is so fine grained, it also comes with some overhead. After all for every draw or dispatch command, X-Plane has to stash away some data. But the goal was to keep this as lightweight as possible, the current implementation stores just a couple of bytes of data and defers resolving all of it until after a device loss. Under the hood, this is implemented with Aftermath on Nvidia and with buffer markers on AMD, plus logic inside of X-Plane for the post mortem resolving of data.
And now it’s your turn. Got an annoying device loss? Run with Aftermath, submit your crash reports and hopefully in an update coming to your install soon, that device loss is gone. And maybe by this time next year, I can sip Martinis on a tropical island because my work is finally done.
Misconceptions
One thing I want to make clear though, device losses and running out of VRAM are two entirely different issues. There is a persistent rumour that low VRAM can cause device losses, but this is not the case. The other thing I also frequently see is the advice to uninstall scenery or plugins. Scenery and/or plugins don’t get access to X-Plane’s Vulkan command stream so they can only very indirectly cause device losses. For example, it’s possible that art controls modified by plugins can cause device losses by enabling shader paths that are not normally taken and thus aren’t fully tested. But in general, it feels like there are a lot of snake oil fixes out there, so please be wary. Neither the log nor the alert box have enough information to triage a device loss to even remotely claim that X is the cause of it. That being said, it never hurts to do some simple A/B testing with things disabled, although this might just mask the problem instead of actually resolving it. A/B testing is particularly useful in the case of a repeatable device loss because you can get a much stronger signal from that test, so you can pass that information along with your bug report and make reproducing the issue much easier for us. For random device losses however it’s near impossible to get a clear signal from such A/B tests.
Excellent write-up, Sidney!
Why do you think there are some folks that NEVER have a VDLE (my last one was several years ago), while others get them regularly?
We all use pretty much the same hardware (3 different manufacturers of GPUs) and run the same GPU drivers and same software (X-Plane) on them.
It seems that some folks get these crashes regularly, others never. Where is the difference?
Despite similarities, setups can be quite different. Each generation of GPUs tend to also have a slightly different architecture, so there is often quite a difference in how things are executed despite the observable side effects staying the same. Plus things like overclock can skew things as well. Not to mention how much load you put your system under and what kind of add-ons are used etc. My believe is that a lot of device losses are down to timing issues where it matters in what order things happen and if there is enough time for caches to get flushed/invalidated to observe what happened. So just holding X-Plane slightly different can have a huge impact by skewing timing results. One other thing I didn’t really mention is that the GPU can also timeout. Command buffers get 2 seconds to execute, otherwise the operating system will reset the GPU because a locked up GPU won’t put new images on screen and no one wants to hard reboot their system just to interact with it. The older folk here probably recognize a lot of these issues as something that used to happen semi frequently back in the day and the industry has been trying really hard to rid the world of the sins of this past. 2 seconds is a really long time in a world where, for a long time, the gold standard for frame time was 16ms and since then has only gone down, but because shaders are just arbitrary code executing, it’s possible that when held just right, loops never terminate or something else locks up the GPU.
Excellent, thoroughly but clearly detailed description of the situation and the response required from users. I think you can award yourself those Martinis anyway for all the effort you have put into XP so far and for the superb results achieved.
Hi sidney, thank you for such an interesting and explanatory post. fortunately I had very few problems of this type with my Nvidia 4070ti super hardware.
Excuse me for going off topic, how is the implementation of Motion Vectors progressing, can we have Something in XP12 in the near future?
Thanks for the explantation.
Excellent piece. Hope it gets read by many.
Proper spelling in two consecutive blog posts: first “labour” from Dellanie, and now “favourite” from Sidney. Well done and keep it up!
Excellent post I must add. How does the graphics driver come into play? Would you say a updated driver is essential, good or not important? I hate device loss crashes as much as I hate BSOD. At times as you said, I too dread it so much I am reluctant to start a flight with all that it incorporates. I think this is very difficult to catch and report, simply as it is extremely random. Using then aftermath would be impossible unless you can use it every time without a performance hit. I have noticed though, that the majority of device losses for me happen in outside view looking at a strang cloud formation. But aldo on ground taxing especially on cold days. ENGM is notorious to produce these crashes, bu still random. I’d say for the most 95% of all crashes happens in air. So much so I now dread going into ouside view in cruise to i.e take a screenshot. 3 out of 4 times my device loss ctd’s happens in ouside view, rarly inside the cockpit. Anyway, hope you find more losses to patch, keep up the good work and pur on.