[This post is a “behind the scenes” look at the tech that makes up the X-Plane massive multiplayer (MMO) server. It’s only going to be of interest to programming nerds—there are no takeaways here for plugin devs or sim pilots.]
[Update: If you’re interested in hearing more, I was on the ThinkingElixir podcast talking about this stuff.]
In mid-2020, we launched massive multiplayer on X-Plane Mobile. This broke a lot of new ground for us as an organization. We’ve had peer-to-peer multiplayer in the sim for a long time, but never server-hosted multiplayer. That meant there were a lot of technical decisions to make, with no constraints imposed by existing code.
We had a few goals from the start:
- The server had to be rock solid. We didn’t want a tiny error processing some client update to bring down the whole server for everyone connected.
- We wanted a single shared world1. Functionally, this means the language/framework we chose would need to have a really good concurrency story, because it would need to scale to tens of thousands of concurrent pilots.
- We wanted quick iteration times. We couldn’t be sure how well MMO would be received by users, so we wanted the initial investment in it to be just enough to validate the idea.
- It needed to be fast. Multiplayer has a “soft real time” constraint, so we needed to be able to service all clients consistently and on time. (Quantitatively, this means our 99th percentile response times matter a lot more than the mean or median.)
Choosing a Language
From those requirements, we could draw a few immediate conclusions:
- The requirements for stability and fast iteration time ruled out C++ (or, God help us, C). Despite having a lot of institutional knowledge about those languages, they’re slower to develop in than modern “web” languages, and a single null pointer will bring down the entire system. (Ask me how I know. 😉 )
- The speed & scalability requirements ruled out a lot of modern web languages like Ruby, where the model for scaling up is generally “just throw more servers at it.” We didn’t want to (forever!) pay the development cost of synchronizing multiple machines across a data center—that’s a drag on both dev time and client latency.
This eventually led me to a few top contenders:
Each of these languages has a solid concurrency story. Rust would probably be the fastest & most scalable, at the cost of developer productivity. But Elixir had one major thing that neither Rust nor Go could touch: fault tolerance built in to the very core of the platform.
Elixir has this concept of running code in lightweight, separate “processes.” These are emphatically not OS processes—under the hood, they’re just a data structure in Erlang/Elixir VM (called the BEAM). One of the core ideas of Elixir processes is that they’re expendable: a crash in one process doesn’t affect other processes, unless those processes explicitly depend on the crashing one. So, consider a process tree structured like this (apologies for my ASCII art):
UDP Server ______________________
/ | ... \ | Spatial Data Store 1 |
/ | \ ----------------------
/ | \ ______________________
/ | \ | Spatial Data Store 2 |
Client 1 Client 2 ... Client n ----------------------
| Spatial Data Store n |
A crash in the Client 1 process affects only that client’s connection—not Client 2, nor the UDP server itself. Likewise, a crash in Data Store 1 (in our case, we partition the data in memory based on each plane’s spatial location) doesn’t affect the data in any other data store.
(Of course, a crash in the base UDP Server would still destroy all client connections—there’s no getting around that, so we try to minimize the work the UDP server itself does.)
This makes an Elixir server extremely fault tolerant. And that’s paid off for us in practice. In the last 30 days, we’ve had ~2,000 crashes in client connections (usually because of garbled UDP packets)—each of these required the client to reconnect behind the scenes, but it didn’t affect any other clients. To date, we’ve never had a crash high enough in the process tree to disconnect multiple clients or lose data, and I don’t really expect we will.
The other thing this process architecture makes possible is fair scheduling of clients against each other: if you have 10,000 clients, and one of them for whatever reason takes 10 seconds to process an update, that client won’t be allowed to bogart a hardware thread—it’ll be suspended after a few hundred milliseconds to schedule another process. That makes it a lot easier for us to keep response times consistent even in the face of unexpected issues in the wild.
The result of all this is that we can support thousands of clients on a single off-the-shelf cloud VM instance, with great reliability. Developer productivity has never been better, either—I went into this knowing zero Elixir, and by the time I had worked through the official Getting Started tutorial, I felt confident enough in the language to dive in.
Surprises with Elixir: The Bad
This wouldn’t be an honest post-mortem if I didn’t talk about the ways in which Elixir didn’t live up to its hype.
First, despite all the tools the Elixir ecosystem has to support multi-node distributed systems (i.e., a cluster of servers), this is never going to be easy if you need synchronization between them. I’m not aware of any platform that does it better, but this is something the Elixir community kind of oversells. Everybody wants to talk about multi-node clusters, but the reality is that (at least at our scale), we didn’t actually need multi-node support, and it would have been utterly foolish to pay the very high dev costs to build it in from the start. If we ever need to support orders of magnitude more concurrent pilots, we’ll do so by moving to a bare-metal, 64-core machine or something… not by spinning up dozens of 4-core VMs.
The same goes for zero downtime deployments. Again, the community loves to talk about this, and it is indeed really cool that the BEAM supports it. (I don’t know of any other plaform where this is possible!) But just like multi-node clusters, there’s a very high dev cost to making this work, and you probably don’t need it. In our case, we’re just doing blue/green deploys: we migrate client traffic from the old server to the new one when you start your next flight.
The biggest shortcoming we encountered in practice was with Elixir’s package ecosystem. To be fair to Elixir, it’s actually way more broad than reading comments on the internet had led me to believe, but it still doesn’t hold a candle to NPM or pip (both for better and for worse). This meant I had to implement the UDP protocol we use for game state sync (RakNet) from scratch2. That was time consuming, but not too terrible. (Of course, I come from the C++ world, where “implement it from scratch” is the default!)
The last pain point I had was with IDE integration. As somebody who uses JetBrains exclusively for all my development work (C++, Objective-C, Android, Python, PHP, Node.js, etc.), it pains me that there’s not a first-party Elixir IDE. The community intellij-elixir plugin is really good for a community plugin, but in no way will it make you think it’s natively supported. Booting up the debugger can take literally minutes on our project—the debugger is effectively useless, and I use test harnesses or logger debugging almost exclusively.
Surprises with Elixir: The Good
There were a few really amazing things I encountered in working with Elixir that I didn’t really expect from just reading about it on the internet.
- Elixir’s support for integrating with Python, C, and other languages that can talk to C wound up being really valuable. This is great for leveraging libraries written in other languages (though it’s not really suitable for use in our real-time updates due to the inherent cost of marshalling data between the two languages). This let me use a METAR parser written in Python from within Elixir, without having to do painful things like call a system process, ask the Python script to write to disk, then read from disk.
- Pattern matching (and more broadly, the general principle in functional languages of working on the “shape” of the data rather than explicit strong types) is intoxicating. It just leads to such simple, straightforward code! This is one of those things that once you experience it, you start conceiving of all programming problems in these terms, and it’s hard to go back to a language without it.
- It’s so nice to have an all-Elixir stack. I’ve written web apps in other languages (PHP, Node, a bit of Ruby), so I was very used to depending on external technologies for core functionality—HTTP servers, caching layers, Cron jobs, etc. This slide from Saša Jurić’s outstanding talk The Soul of Erlang and Elixir really sums up how Elixir can serve as a web service unto itself: Now, to be clear, is Elixir’s version of these tools as fully featured as the alternative, standalone version? Probably not. But for X-Plane’s use case, we’ve not found any shortcomings, and without being an expert in Redis/Cron/PM2/whatever, I couldn’t actually tell you what Elixir’s version of this stuff is lacking. And that’s the point, really—instead of needing expertise in a bunch of different tools, you can learn one (i.e., Elixir) really well.
Want to Get Started with Elixir?
If all this is intriguing enough to make you want to dive into Elixir, I can recommend a few resources:
- The best place to start is the Saša Jurić talk linked above. This gives an overview of the philosophy of Elixir (and Erlang, which it’s built on). It’s a great introduction to the high-level concepts you’ll build everything else on top of.
- Next, go through the official Getting Started tutorial. I’ve never seen a language’s first-party documentation as good as Elixir’s. You could honestly read this alone and have enough knowledge to write production services.
- Saša Jurić’s book Elixir in Action. I’ve read most of this for a deeper dive into the language, and while it’s not necessarily required reading beyond the official tutorial, I did find it valuable.
Thoughts, questions, comments? You can drop them in the comments below, or hit me up on Twitter.
 Long term, we might actually want to split the world’s traffic into multiple servers (e.g., one for Europe, one for the Americas, etc.), since no amount of technical tricks can eliminate the latency of sending a packet from, say, Sydney to New York. For the initial release, though, we could deal with the latency, and we wanted the option of hosting tens of thousands of players on a single server.
 We recently open sourced the RakNet protocol implementation—you can find it in the X-Plane GitHub. The README gives a good overview of the full MMO server’s architecture, too: each client connection is a stateful Elixir process, acting asynchronously on a client state struct; clients asynchronously schedule themselves to send updates back to the user.