Category: Really Really Really Really Boring Stuff

What’s up with device losses in X-Plane anyways?

One of the most well known errors you will encounter in X-Plane is the infamous device loss error. But it’s also one of the least well understood errors and one of the hardest to debug, so in this blog post I want to talk about what device losses are, what we have done to fix them and what you can do to help us investigate these issues. Let’s actually start with the latter because if there is one takeaway from this blog post for anyone, it’s the following:

In case of device loss

If you encounter a reliable device loss error, please run X-Plane with --aftermath in the command line options, or run X-Plane via the X-Plane_aftermath.bat script from the Support folder inside of your root X-Plane folder. This flag currently works with both Nvidia, AMD and Intel hardware, but it will come with a bit of a performance hit. It’s best used in cases where you already know you are going to run into a device loss and want to provide additional diagnostics. I would not recommend just running with this flag, in particular because it does not work around device losses in any way – It just gathers additional data to make device loss debugging easier. When running with Aftermath, next time you encounter a device loss, X-Plane will no longer call it that but instead say “Encountered a GPU crash!”. Please submit those crash reports! If you see this message it means that X-Plane was able to gather diagnostic data that can help identify what happened.

What is a device loss anyways?

Everyone knows that applications can crash. The human species is incapable of writing bug free code and some of these bugs lead to crashes, while others just result in broken logic ranging from mildly entertaining to downright annoyingly broken. But crashes are definitely the worst kind of bug: If you have invested 7 hours into a flight and then suddenly X-Plane crashes, it’s probably one of the worst experiences in flight simming. I know it has put me off from virtual flying for weeks in the past. Luckily, when applications crash, it’s possible to capture the current state of execution and everything that lead up to it. As programmers, we get to see the callstack which tells us not only in which function we are currently in and where, but also what function has called it and in turn which has called that function and so on. We can also see what thread executed it, what the other threads were up to and even the state of the CPUs internal registers that hold part of the current program state. While this isn’t always enough to create a fix, it points in a very specific direction and lets us investigate the code and, together with the log, piece together what happened. Essentially, when you submit a crash report, you send us the black box of your flight. It contains forensic information that can be analyzed to figure out what happened, but we don’t look as cool as NTSB investigators.

Device losses in reality are just crashes as well, but crashes on the GPU. GPUs these days, and really since a long time, are almost fully programmable and just execute arbitrary code. Programs that execute on a GPU are called shaders, a leftover from the olden days when they could exclusively be used to shade vertices. But these days, shaders are just tiny programs that execute on potentially tens of thousands of GPU cores. Shaders are responsible for basically everything that X-Plane puts up on the screen, from mundane tasks like transforming vertex data into screen space, calculating the colour of every single pixel all the way to culling tens of thousands of trees per frame. X-Plane makes use of a lot of shaders because GPUs are incredibly flexible and a lot of workloads lend themselves really well to their highly parallel nature. Counting shaders is surprisingly hard, X-Plane bundles shaders into what we call “modules” which contain variants of similar shaders. But doing a count of just modules, X-Plane 11 shipped with 29, while X-Plane 12.1.3 had 88 modules and 12.2b4 ships with 91. Each shader module contains between a handful to thousands of distinct shader variants, so the real runtime count of compiled shader programs is much higher, somewhere in the mid to high 5 digit range. Quantity is of course not a measure of quality, but X-Plane 12 as well as any modern game wouldn’t be possible without the ability to flexibly run millions of shader invocations every single frame.

Just fix the crashes then

While shaders are no doubt one of the greatest additions to computer graphics, they come at a cost: They are incredibly hard, near impossible, to debug properly. On the CPU side you can attach debuggers to your code and then step through the code as it executes while seeing the result of each operation. On the GPU that is not a possibility, you can’t inspect the state of a running shader, only the results of the execution after the fact. And if nothing shows up on screen, then good luck trying to figure out where your triangles went. And if it crashes, you are even more toast because now you only got broken pieces to look at left. There are tools to make this a bit easier and it definitely isn’t as bad as it used to be, which is a big reason why it’s possible to put a lot of shaders into the field now: The tooling around this is has gotten a lot better. But for crashes, the GPU once again turns into a terrible blackbox.

The first hurdle is that the CPU and GPU run asynchronous from each other. The CPU encodes operations for the GPU into a command buffer that is then submitted to the GPU for execution, so right there you already have a source of latency. In practice, the CPU is usually at least a frame ahead of the GPU in terms of what it is computing. Detecting a crash also involves latency, often the OS/driver has to recover the GPU from its current state and put everything back together. Since the GPU is responsible for getting anything onto the screen, the OS and driver are much more interested in getting you back into a state where you can interact with your computer. But eventually, X-Plane will catch up to the fact that the GPU crashed, because at some point a call into Vulkan will produce the infamous error code VK_ERROR_DEVICE_LOST. This is why it will often feel like your computer is glitching out prior to a device loss, everything stops working for a second, the OS and driver recover the GPU, your windows and displays might flicker around for a second and only then will X-Plane go “yeah by the way, stuffs broken, yo”.

This huge latency between cause and effect is one of the big issues with device loss debugging. By the time the CPU side of X-Plane realizes the GPU is dead, it’s too late to gather data anymore: Something we did god knows how long ago crashed and that’s all the information we have. This is also why the log file tends to just not be very useful for device loss triage. This is especially bad when the device loss is hard to reproduce and happens only once every blue moon. My favourite kind of device loss is the one where you can get X-Plane crash really fast and reliable, in those cases it’s possible to start toggling things on and off and compare what happens to zero in on what is the actual cause. I have spent many a day having X-Plane take down my window manager over and over again.

What’s being done about device losses then?

Over the years we’ve fixed a bunch of device losses. 12.06 was probably the biggest release here, cutting device losses by about 75%. 12.1.0 also reduced device losses by a large fraction by working around a bug in the Nvidia driver. But of course new code gets written and new code can always be buggy. Way back in 11.50 we also added support for Aftermath, which is a library from Nvidia that helps gather crash dump data off of the GPU. In 12.2, I have massively reworked Aftermath support and also added support for AMD GPUs. While AMD does not actually use Nvidia’s Aftermath library, for simplicity sake the same command line option is re-used and my hope is to also add support for this for Intel GPUs in the future Edit: It turns out, Intel supports the AMD extension used for this, so Intel GPUs are supported as well.

With 12.2 and Aftermath enabled, X-Plane now injects per draw/dispatch checkpoints into the command stream. In the event of a device loss, these can then be analyzed to recover the GPUs program state. Because it is very fine grained, this data is incredibly valuable and has actually helped fix two device losses already. The downside is that, because it is so fine grained, it also comes with some overhead. After all for every draw or dispatch command, X-Plane has to stash away some data. But the goal was to keep this as lightweight as possible, the current implementation stores just a couple of bytes of data and defers resolving all of it until after a device loss. Under the hood, this is implemented with Aftermath on Nvidia and with buffer markers on AMD, plus logic inside of X-Plane for the post mortem resolving of data.

And now it’s your turn. Got an annoying device loss? Run with Aftermath, submit your crash reports and hopefully in an update coming to your install soon, that device loss is gone. And maybe by this time next year, I can sip Martinis on a tropical island because my work is finally done.

Misconceptions

One thing I want to make clear though, device losses and running out of VRAM are two entirely different issues. There is a persistent rumour that low VRAM can cause device losses, but this is not the case. The other thing I also frequently see is the advice to uninstall scenery or plugins. Scenery and/or plugins don’t get access to X-Plane’s Vulkan command stream so they can only very indirectly cause device losses. For example, it’s possible that art controls modified by plugins can cause device losses by enabling shader paths that are not normally taken and thus aren’t fully tested. But in general, it feels like there are a lot of snake oil fixes out there, so please be wary. Neither the log nor the alert box have enough information to triage a device loss to even remotely claim that X is the cause of it. That being said, it never hurts to do some simple A/B testing with things disabled, although this might just mask the problem instead of actually resolving it. A/B testing is particularly useful in the case of a repeatable device loss because you can get a much stronger signal from that test, so you can pass that information along with your bug report and make reproducing the issue much easier for us. For random device losses however it’s near impossible to get a clear signal from such A/B tests.

Posted in Development, Really Really Really Really Boring Stuff by | 8 Comments

Have You Heard the Good News About Elixir?

[This post is a “behind the scenes” look at the tech that makes up the X-Plane massive multiplayer (MMO) server. It’s only going to be of interest to programming nerds—there are no takeaways here for plugin devs or sim pilots.]

[Update: If you’re interested in hearing more, I was on the ThinkingElixir podcast talking about this stuff.]

In mid-2020, we launched massive multiplayer on X-Plane Mobile. This broke a lot of new ground for us as an organization. We’ve had peer-to-peer multiplayer in the sim for a long time, but never server-hosted multiplayer. That meant there were a lot of technical decisions to make, with no constraints imposed by existing code.

Read More
Posted in Development, Really Really Really Really Boring Stuff by | 21 Comments

XPLMInstance: Two Tricks

This post is just targeted at plugin developers who are modernizing their object drawing – if you don’t write plugin code, the Cincinnati Zoo has been showing their animals on Youtube – it’ll be a lot more entertaining than this post. (An XPLMInstance cannot tunnel down two feet in fifteen seconds – one point for the zoo animals.)

XPLMInstance makes a persistent object that lives inside X-Plane that is visible in the 3-d world. It changes how you draw from “run some drawing code every frame” to “tell X-Plane that there is a thing and update its data every now and then.”

Instancing is actually a lot easier than draw callbacks! But there are two tricky gotchas:

1. You must create the custom DataRefs for your OBJ’s animation before you load the object itself with the SDK. (If the DataRefs do not exist at load time, the animations are disabled as “unresolved to any DataRef”.)

2. When you create the instance, make sure your custom DataRefs are on the list of DataRefs for that instance.

Here’s the really baffling thing: if you create the custom DataRef and then add it to the instance’s list, your DataRef callbacks will not be called.

Wha?

Here’s the trick: the DataRef you register is a global identifier, allowing the object to refer to what it wants to listen to. That’s why you have to create the DataRef – so that the identifier exists.

But when you create an instance, each instance has memory that holds a different copy of those DataRefs.

For example, let’s say you have a truck with four DataRefs, and you make five instances. X-Plane allocates 20 slots (four DataRefs times five instances) to store five copies of each DataRef’s values.

The instances never look at the DataRef itself. They only look at their local copies. That’s why when you push different data to the instance with XPLMSetInstancePosition, each instance animates with its own values – each instance looks at its own local data.

This is also why you won’t see your DataRef callbacks called (unless you use DataRefEditor or some other tool). The object rendering engine isn’t looking at the DataRefs themselves, it’s looking at the local copies.

In other words, XPLMInstance turns DataRefs from the pull model you are used to (X-Plane pulls on your read function to get the value) to a push model (you push set with XPLMSetInstancePosition into the instance’s memory).

This implies two things about your add-on:

  • It doesn’t really matter what your DataRef read functions do – they can just return zero, and
  • You can’t use tools like DataRefEditor or DataRefTool to debug your animations. (That didn’t work well in legacy code either, but it really won’t work now.)

If you try the obvious optimization of not creating your custom DataRefs (“hey, no one calls them”) before you create your instance, you will find that animation just stops working. This is because we need the DataRef to be that global identifier to match your instance data with the animations of the object itself.

One last note: if your old code used sim/graphics/animation/draw_object_x/y/z to determine which object was being animated (from inside a plugin “get” function) you do not need to do this anymore. Because each instance has its own local copies and your DataRef function isn’t called, this technique is obsolete.

In summary:

  • You must register custom DataRefs.
  • Their callbacks can just return 0 – they’ll never be called.
  • Always list your custom DataRefs for animation when you create an instance.
  • Do not use draw_object_x/y/z; use XPLMSetInstancePosition to create per-specific-instance animation.
Posted in Development, Plugins, Really Really Really Really Boring Stuff by | 22 Comments

Linux users: Please don’t run X-Plane with sudo!

TL;DR: Running X-Plane with sudo is a bad idea. Instead, create proper udev rules (per this and this).

During the 11.10 beta, I’ve gotten a lot of bug reports from Linux users who report that their keyboard is being recognized as a joystick. This is… sort of a bug, but mostly intentional.

(If you’re not a Linux user, this won’t apply to you… but it will bore you! 😉 )

Background: What changed?

On Linux, prior to X-Plane 11.10, we were very picky about what USB devices we considered to be a joystick: we required a device to present a so-called “absolute” axis (in contrast to a “relative” axis like a mouse uses). The downside of this is that it prevented home cockpit builders from creating button-only hardware.

So, in 11.10 and beyond, we relaxed the requirements: if a USB device presents us with either an axis, button, or hat switch, we’ll treat it like a joystick.

The problem with this policy seems obvious: keyboards have “buttons”! Like, 104+ of them!

The reason we didn’t worry about this is that the keyboard is only accessible (as a USB device) to programs running as root. So long as X-Plane runs as a normal user, it doesn’t even have the option of treating the keyboard as a joystick.

Why do people run as root?

The impetus for running as root (via sudo) is simple: if your Linux distro doesn’t recognize your joystick hardware as something that should be available to normal applications, running as root is a brute-force way to let X-Plane use your joystick.

Let me say emphatically: This is a bad idea.

Especially with early, buggy betas, running as root makes it possible for X-Plane to do way more damage to your system than would ever be possible as a normal user. Consider the unlikely—but possible!—scenario where somebody made a typo in the code which inadvertently tries to delete a system folder. There are two possible outcomes here:

  • If you’re running as a normal user: Nothing happens. The operating system refuses to let X-Plane hurt your system.
  • If you’re running as root: The operating system silently obeys. You curse X-Plane for breaking your system.

Running X-Plane as root is like giving a blank check to every cashier you buy something from—it’s way more power than they need to do their job, and it’s liable to burn you at some point!

The Right Way™ to let X-Plane use your joystick

As described in the latter half of this old dev blog post, you don’t have to run with sudo. Instead, you can create udev rules to tell your operating system to let normal applications use your joystick. The GUI tool linked at the end of that post makes it even easier.

(Some users found the instructions there confusing; this post on the Org might help.)

Remember that after you create your rules, you can even submit them to your distro to make life easier for other flight simmers!

There’s one hitch: after running with root, your file permissions (especially your prefs) may have gotten screwed up. This can be fixed from the terminal by making your normal user account the owner of your X-Plane directory, like this:

$ sudo chown -R <username>:<username> /path/to/X-Plane/

(So, in my case, my username is tyler, and X-Plane is installed to ~/Documents/X-Plane/, so I’d run $ sudo chown -R tyler:tyler ~/Documents/X-Plane/.)

Now, to those of you who have been running as root… “go, and sin no more”! 😉

Posted in Really Really Really Really Boring Stuff by | 14 Comments

X-Plane 11.05, 11.10, and My Mostly Dead Hard Drive

TL;DR version: my iMac’s fusion drive “lost its marbles” right before I went on vacation. This has delayed cutting an 11.05 release candidate 2 with a few scenery fixes, but we should get to it next week. In parallel, we’re working furiously to get all of the code locked down for 11.10.

Everything else that follows is really, really, really, really boring. I’m writing it only because some of my co-workers watched this slow motion car crash and tightened up their backup game a bit. If my drive fail can shake you out of complacency, read on.

Basically: my iMac is my main development machine, and the data is backed up and/or duplicated in a bunch of different places: a USB time machine archive, a Backblaze cloud backup (both are “full machine”), DropBox for virtually all of my documents, and my work for Laminar is kept on Laminar’s source control servers. Data loss was never a huge risk here.

Time loss, however, is a real risk! My goal was to lose as little work time to fixing my machines as possible. So my plan was: restore from time machine disk backup, request a cloud backup restore via hard drive, return the hard drive. The total cost would be a few hours of disk copying and less than an hour of my time. My development machine would be usable for new work while waiting for the cloud backup to arrive.

This has not gone as well as I had hoped! You can learn from my fail here — a few notes.

  1. Your backup might as well not be a backup if you have not checked that the backup contains the data you think it contains. It turns out that both the cloud backup and time machine backup were missing files!  I’m very lucky that they weren’t missing the same files.
  2. Time machine sometimes decides not to back stuff up. OS X has a hidden per-file/directory attribute that can exclude a file from backup without showing it in the Time Machine UI!  Once you check your time machine backup and find a folder is missing, from terminal you can do tmutil isexcluded <file path> to see if the file has been explicitly excluded.  If it is, tmutil removeexclusion <file path> fixes this.
  3. Backblaze ships with a bunch of file exclusions too – mostly designed to not archive stuff that isn’t your data. But beware – stuff you care about might not be on the list. (For example, virtual disks in a virtual machine are excluded by default.)  I had to add back .iso files to the backup list. Backblaze backups are also not bootable. This is something I can live with, but always read the fine print about what’s in the backup.
  4. The Backblaze data restore has been very slow – over ten days for less than half a terabyte and it’s still “in progress”.* While they haven’t exceeded the maximum restore time they advertise, it’s slow enough that the delay matters.
  5. One other note on Backblaze: I saw major performance problems on my iMac while Backblaze was running, even when a backup was not running (since they were scheduled for overnight). I do not think this is necessarily Backblaze’s fault – it may be a problem with CoreStorage (which “runs” the fusion drive) or even a fault with my drives. From what I can tell, cloud backup exacerbated it by putting a lot more file traffic on my system.
  6. A possible danger if (like me) you keep documents on DropBox to have them everywhere: when I restored my iMac from Time Machine, I was exposing DropBox to my data from a week ago. I didn’t wait to see if DropBox would figure out what happened; I unlinked my iMac while it was offline after the restore, then re-established DropBox and let it download my data. Better safe than sorry.
  7. I have been backing up to portable 2.5″ USB drives because they’re cheap and really convenient, but they have a down-side: the mechanisms can easily fail and take your whole backup down. I have five of these drives and one has failed in a three year period.
  8. I’m really unhappy with CoreStorage, to the point where I would not recommend a fusion drive anymore. CoreStorage is an Apple virtual-volume technology (similar to soft-RAID) that makes one small SSD and one large HDD look like a single unified volume, with some of the data “cached” on the SSD for performance. CoreStorage is a lot newer than HFS, so when things go wrong, most disk utilities you would go to just don’t work.

I actually ended up in a state where (after wasting almost an entire day) I could see my data, but only in single-user mode with a read-only file system. I might have been able to directly copy the data, but I picked to format the drive and restore from the backup to save more of my time and get back to coding X-Plane.  My suggestion for developers getting iMacs: get an internal SSD (whatever storage size you can afford) and supplement with a fast external hard drive over Thunderbolt.

Going forward, I am replacing the portable backup drives with a Synology NAS RAID device – this gets me high performance, high capacity backup (about 10 TB) with redundant drives. I picked HGST drives because they’ve had a good track record for reliability. With a large network attached storage server, I can have all of my machines backing up in the house all of the time, and have that be the primary way of getting my data back. I’m keeping cloud backup as a last-resort-the-house-burned-down kind of thing.

If my cloud backup hasn’t shipped Monday, I will rebuild the setup I use to cut builds by hand (it’ll take a few hours but it’s doable) and we’ll cut 11.05r2 that way. If the drive comes, I can get the last of my data back and we’ll get to 11.05r2 the easy way. Either way, we’ll get things moving again.

 

* I opted for a hard drive restore, which should have one day of shipping time, instead of a download; a smaller restore based on download made clear that the transfer speeds would be slower than FedEx for that quantity of data.

Posted in Development, Really Really Really Really Boring Stuff by | 31 Comments