The Anatomy of a Plugin Crash

This morning I found out why certain plugins (e.g. SeaTraffic, etc.) are crashing X-Plane 10.50 beta 1 on startup. Beta 2 was being uploaded when I found this, but I'm going to fix this bug first and recut the beta, as it's a big bug (since it renders many users' systems to be dead-on-boot).  Beta 2 should be up by tomorrow morning but might go live some time today if things go well.

The plugin-crash bug is a regression bug - X-Plane used to do something right and has now "regressed" and is doing it wrong in X-Plane 10.50 beta 1.  This is our bug to fix, and we will fix it in beta 2; no plugin will need any changes, and the plugins crashing in beta 1 will start working again in beta 2.

That's everything you need to know about the plugin crash bug. What follows is a very long  and verbose write-up of the crash (think of it like an NTSB accident report). Perhaps it will provide plugin authors with some insight into where compatibility problems come from, and why early betas can be unstable.

A Proximate Cause: Loading an OBJ "Too Early"

We'll work backward from the proximate cause (e.g. literally why did the plugin crash) and work our way backward to eventually understand the "why" of this crash. The fix for the bug is relatively simple, but you can't know if a bug fix is correct without understanding the complete why; bugs like this take a lot more time to diagnose than to cure.

The crashing plugins are all calling XPLMLoadObject from their XPluginStart callback - that is, they are asking X-Plane to load an OBJ at the earliest possible time they can. This is totally legal! But it turns out that:

  1. X-Plane isn't ready to load an object at plugin start and
  2. The penalty for doing so is a lot worse (a crash always) in 10.50 than in 10.45 (a crash rarely).

Why isn't X-Plane ready to load objects? The answer lies in a little-known part of the OBJ file format called "conditionals."  Conditionals are a lot like C preprocessor macros - they are IF-THEN statements in an OBJ that let you make parts of your OBJ file only run in certain settings. For example, we use conditionals to remove the artist-drawn shadows from the static aircraft when X-Plane is in a mode that can draw shadows dynamically; this prevents "double-shadow".

Conditionals are weird - they work not by evaluating the condition when we draw, but rather when we load; if a conditional is false in the OBJ, that text in the file is literally not used.* This means we need conditionals to work at load time, which in our case means when XPluginStart is called.

Unfortunately, conditionals are set up when we load preferences, and we load preferences after we call XPluginStart. So when these plugins load an object, the conditional system isn't inited, and the OBJ loader crashes. Since we called the OBJ loader from a plugin, the plugin gets blamed. (A plugin author looking at the plugin crash can easily tell what happened - "I called XPLMLoadObject and it exploded!".**)

Is this typical?  Yes. Often it's the combination of special rules for initialization, plugins using code in a way the sim doesn't, and multiple subsystems that cause crashes.

Why Now?

When I found the code path that was causing the crash, my first reaction was "this is a mess...how did this work in X-Plane 10.45???".  So I went back to the X-Plane 10.45 code and found that this was broken in X-Plane 10.45 too! So what changed?

The answer is that in X-Plane 10.45, the conditional system would only crash (due to not being inited) if it was used; in other words, any object with IF SCENERY_SHADOWS (for example) would crash if loaded from a plugin at XPluginStart. But if you didn't use the conditionals, things worked fine. And my guess is that very few plugin authors use conditionals in their objects.

What changed is that X-Plane 10.50 uses the conditional system for every object load, even if you don't have an IF statement. So now the crash is 100% reproducible, not a rare "five things must all go wrong at once" kind of crash.

Is this typical? Yes. This is a bug where something was fundamentally broken for a long time, and an incremental internal change in how our code works changed the symptoms from 'rare' to 'always'. This is very typical of beta bugs.

There may be other bugs like this too - it looks to me like objects loaded from XPluginStart with named lights might not get their named lights.

Why Do You Need Conditionals If I Don't?

The next question this raises is: why did you monkeys change the OBJ loader code? What was wrong with what we had before? What kinds of X-Plane features cause this code change that break plugins?

The conditional code changed to fix a bug that shipped in X-Plane 10.45. The bug is: if you start the sim with HDR off and then turn HDR on, some spill lights don't appear. The cause is that the OBJs are loaded with spill lights stripped out for performance*** (since HDR is off). In X-Plane 10.45 when you turn on HDR, we reload a lot of objects - but not necessarily the ones we need to, and the spill lights are lost. Rebooting brings the lights back.

To fix this, I modified the OBJ engine to track which conditionals any given object uses. If you add "DEBUG" to the end of your object to view the diagnostics, you'll now see this in the output. When you change a rendering setting in X-Plane 10.50, only the objects that use a conditional that was affected by the settings change are reloaded.

That's a big improvement in loading rules compared to X-Plane 1045. Turn on HDR and only objects with spills are reloaded; turn on shadows and only objects with optionally baked shadows are reloaded. But it means we need to use the conditional system on every object load to set up those flags; that's what exposed the bug.

Is this typical? Yes! What we have is a simple refactor making different use of an internal API to fix a bug, and the results make a separate existing bug worse; that separate bug is reproducible only via plugins.

Is There an Ultimate Cause To All of This?

When fixing bugs like this I have to ask myself "is there a way this could have been avoided?"

The root of this bug is the root of a large number of plugin quirks and edge cases: XPluginStart (the first thing a plugin does) is called insanely early in X-Plane's load process; as a result, a lot of the SDK isn't actually available.

The decision to call XPluginStart early is a bad decision, and it is one that I made, well over a decade ago, to solve one specific problem. At the time, X-Plane (6!) had no option to save the selected AI aircraft to preferences. XSquawkBox needed specific AI aircraft loaded to support multiplayer, and it was really slow to let X-Plane load 20 random aircraft, then reload them all later.

To "solve" this (and I use the term loosely, since this one fix has been the source of so many bugs) I put XPluginStart before almost all parts of sim load, so that the XPLMAircraft API could let a plugin pick the AI aircraft before they were loaded, influencing the first load.

If I had a time machine, I'd go back to 2000 and kick myself in the ass. This early in load, virtually all rules of how X-Plane work are wrong since so much of the sim is not yet loaded. Plugin authors already cope with this by "deferring" their work until the sim has fully loaded; our advice for a while has been to not touch any aircraft before this point.

So is this typical? Yes - it's yet another edge case introduced by plugins starting unnecessarily early.

 

* If you are wondering why conditionals work at load time and not run time, the answer is instancing. X-Plane does the complex analysis to categorize an object as instancing-friendly when it is loaded; attributes disable instancing. The conditionals run at load time so that if attributes like ATTR_poly_os are removed for shadowing, are removed by conditionals, then the object becomes instancing-friendly (because the conditional is like removing the text).

In other words, by having conditionals "pre-process" the OBJ file, you don't pay for what you don't use.

** If you call a plugin API from your plugin and the plugin code crashes, it could be X-Planes fault or your fault; plugin APIs may crash if given bad arguments (e.g. a junk pointer for a string argument).

*** This is the same idea as above; by removing stuff you don't use like a spill light that isn't seen with HDR off, we can get X-Plane onto a faster path. For example, if an object contains a spill light with a dataref, it can't be instanced; if the spill light will never be drawn, deleting it makes the object instancing-eligible.


  • Facebook
  • Reddit
  • StumbleUpon
  • Twitter
  • Google Buzz
  • LinkedIn

About Ben Supnik

Ben is a software engineer who works on X-Plane; he spends most of his days drinking coffee and swearing at the computer -- sometimes at the same time.
This entry was posted in Uncategorized. Bookmark the permalink.

11 Responses to The Anatomy of a Plugin Crash

  1. Mario Donick says:

    That's VERY interesting, and also very well explained! Thanks a lot for the insight!

    (And it explains why sometimes I had no or way too few lights when switching from non-HDR to HDR...)

  2. Carlos says:

    Ben,

    Is is going to be necessary to download a new installer once b2 goes live?

    • Ben Supnik says:

      No. If you have the current installer (3.40r3) you don't need a new installer.

      If you are experiencing crashes on startup you may have to -manually- check for updates using that installer; some of the crashes in beta 1 crash before the auto-update check happens.

  3. Stevil says:

    Cool, Ben!
    Thanks for your hard work and especially for that detailed report of your findings. It was really interesting to read 🙂

  4. This is a bit of a meta-comment, and is probably only of interest to the nerdy types... so any reader should feel free to skip if it ain't of interest to you 😉

    In the project I am currently emplyed in, just until a couple of weeks ago, we had one automated test that was failing "sometimes". There was no clear cause for the test to fail, but myself and some other members of the team that took some time to investigate postulated that it was due to timing issues (always a nasty thing with asynchronous code execution). Eventually I was tasked with actually fixing the issue. So, my initial reaction was to refactor the code in such a manner that the timing would not be an issue. Once I did that, the test started to fail *always*.

    Many junior software developers would think that this was a turn for worse, but I was positively thrilled. Instead of a test that was randomly failing, I now had a test that was always failing - so ... this brought the realisation that the random failures were *not* the problem ... the problem was that most of the time the test was "randomly" passing - even though it should have failed. So ... I was thrilled because I had made the test behave deterministically and this was the first step in getting the feature actually work the way it was supposed to.

    What to take home from all of this? a) Automated tests are developer's friend, not an enemy. b) If/when your tests fail - even randomly, look into it; there is possibly a bug involved. c) If you make a change that causes a test that sometimes fails to fail always - that is not necessarily a bad thing.

    ...oh, and in case it is of interest to anyone, I did manage to fix the actual root cause of the failure in our tests. Since then the test (unmodified since the initial modification that caused it to always fail) has passed without problems. That is always nice thing to take home after work 🙂

    • Joseph N. says:

      That is great to hear! It sort of reminds me of the experimental process. d

      Glad the issue was worked out, and this is a great connection to Ben's post.

  5. Glen Andker says:

    Ben,
    Is it possible you could apply this fix to the current 10.45 stable version?

    • Ben Supnik says:

      No.
      - This bug is new to 10.50, so applying the fix to 10.45 makes no sense.
      - We aren't going to patch 10.45 at all - 10.50 -is- where bug fixes will go.

  6. Stanislaw Halik says:

    Hey Ben,

    Is lazy-loading .obj conditionals' dependencies out of the question?

    cheers,
    sh

    • Ben Supnik says:

      I think so, yeah. If you want your object lazy-loaded, you should do so at the plugin level; since the next thing you can do with an OBJ is -draw- it, lazy loading by the SDK would imply blocks -during- drawing for loading objects. So we assume that if you want the OBJ now, you mean it. Note that you can also schedule an object to load async, which is often the best compromise.

  7. Saar says:

    Sounds like catch 22 to me.

    A workaround might be sensible in current iteration of XP life. Maybe on next XP version you will be able to fix this issue, since incompatibilities should be acceptable.

Comments are closed.