25 Sept 2012

Nebula3/NaCl "Post Mortem"



Ok, it's not really a post-mortem since I consider the work on the Native Client port far from finished. For my current "weekend stuff" NaCl actually provides a pretty cool distribution channel for simple tech demos, so this gives me enough motivation to keep the port fresh and uptodate.

The Chrome Web Store lets me publish those small tech demos in an extremely simple way: just zip a directory and upload it to the web store dashboard, without any review or certification process getting in the way, the demos automatically work on Windows, OSX and Linux, and it's extremely easy for the user to install and run the demo (just a few seconds and 2 button clicks from start to launch).

Ok, let's go:

Build System and General Workflow

Native Client offers a standard GCC-cross-compiling toolchain and is using regular Makefiles (they switched away from scons some time ago). Nebula3 is using CMake as meta-build-system. CMake generates platform- and IDE-specific low-level-build files from a set of generic project description files.

For Native Client I wrote a very simple CMake toolchain file which points the build process to the right GCC tools and directories. This is the same general procedure I already used before for the Android and Alchemy platforms (both were just relatively short-lived experiments, so don't hold your breath). CMake generates Makefiles and an Eclipse wrapper project from the NaCl specific toolchain file and the generic project description files.

I'm currently doing most of my spare-time programming work on OSX, and the most practical option there was to generate Eclipse projects. I didn't manage to automatically generate an XCode project with cross-compiling support through CMake, only by manually creating an XCode project with an "external build system target" which points to the CMake-generated Makefiles.

A special feature of working with NaCl is the 32-bit vs. 64-bit support. NaCl projects are *required* to offer both a 32-bit and a 64-bit executable. I simply solved this by generating 2 separate sets of build files from two different CMake toolchain files. 

Working with the NaCl projects still isn't as fluid as working on a native OSX project in XCode or a Windows project on VisualStudio. There's no IDE debugger, compile times are much longer (especially compared to XCode), and the edit-compile-test-loop is more complicated because the NaCl executable can only be run inside Chrome.

That's why I'm still doing most of the general coding and debugging on a regular OSX application project inside XCode, and use the NaCl build system as a pure cross-compiling solution with command-line make. Only if there are bugs in the NaCl-specific code, side-effects which only happen on the NaCl platform or new NaCl functionality must be implemented I spend reasonable time inside Eclipse.

NaCl will support a CPU-agnostic binary format called PNaCl in the (hopefully not to distant) future. The build process will then generate some sort of LLVM bitcode, and finish the compilation process on the client-side for a specific CPU instruction set. This will probably speed up the build process and get rid of the several build-passes (and hopefully doesn't increase the startup time in the browser).

Also there's a VisualStudio addin which seems to be nearing completion. I must admit that I didn't look at this yet. If this solution provides a usable UI debugger I'll be in heaven :)


GLIBC vs newlib

NaCl offers 2 options for the C runtime: the "fat" option is GLIBC, the slim option is newlib. I first went with GLIBC because it sounded like the more complete option and makes porting POSIX code easier. The problem with the GLIBC runtime is that it uses dynamic linking and needs to download a myriad of huge DLL files to the client before the actual client binary can be launched. After I saw the resulting download size just required to start the executable (I think it was near 20 MB?) I was scared and went with newlib.

Now, DLLs *may* be an option in the future if the DLLs could be shared between different NaCl projects, and NaCl becomes really popular (or most of the DLLs would already be included in the standard Chrome installation), but as far as I understand, none of this is true yet.

With newlib you get an all-in-one executable, everything is nicely statically linked, and dead code stripped away. For a simple Nebula3 viewer I ended up with 6.8 MB for the 32-bit nexe, and 7.2 MB for the 64-bit nexe, still quite heavy, but ok. The compressed size for the Chrome Web Store packaged-app ended up around 4 MB.


Porting Fundamentals

You should start with a "POSIX + OpenGL + 64bit-clean" codebase and implement the NaCl specific stuff from there. If you're coming from "Win32 + DirectX" this is quite a bummer of course. Thankfully I had all basic requirements already in place. For the Drakensang Online server we had ported the Nebula3 Foundation Layer to 64-bit-Linux, and for my various other porting experiments (iOS, OSX, Android, etc...) I had the (old) Nebula3 render pipeline already ported to OpenGL. 


NaCl Restrictions

Ok, this is where it gets interesting. NaCl has a very "interesting" application model. If you're familiar with co-operative multitasking (like back in the days pf Windows3.1 or the Wii) you'll feel right at home :D

The problem seems to be that Chrome can't give you complete control of your game loop for security reasons. Each Chrome tab is running in its own process, but a NaCl application is sharing this process' main thread (called the "Pepper Thread") with other tasks which need to run on a web page (like the Javascript engine, the page layout engine, the HTTP communication code, etc...). The details are not so important and may change between Chrome versions, the important info is that a NaCl application can completely block a Chrome tab or lead to unresponsive behaviour if one "NaCl frame" takes to long.

So that's the first problem: you're only allowed to spend very little time on this "Pepper Thread" and then need to give control back to Chrome.

And then there's the famous "Call-On-Main-Thread" restriction: most of the "system level stuff" needs to be called from the main-thread. If you want to download a file: main-thread. If you want to call an OpenGL function: main-thread. Getting input events: main-thread. And so on. Fortunately this restriction doesn't apply to the C runtime functions like malloc().

So problem #2 is: "Call-On-Main-Thread-Restriction".

Problem #3: Tasks can be scheduled to execute on the Pepper thread, but they are processed through a simple worker queue... which means: you can start an asynchronous task to execute on the Pepper Thread from *another* thread and then block the thread until the execution of the task has finished to simulate a blocking operation. But you can't do this on the Pepper Thread! Since the asynchronous task will only execute after you gave control back to the Pepper Thread you just shutdown the whole execution pipeline by waiting for an event which will never trigger.

All of these 3 problems combined result in a very nasty restriction for "real world code": you can't simply do an fopen() followed by an fread(), or a bit more general: you can't do any blocking operation from the Pepper Thread.

Unfortunately, the first thing the Nebula3 render thread does, is to load shader files in a synchronous manner, and it can't really continue without the shaders being setup. Ok, no problem, that's why it's a render *thread*, we can simulate blocking i/o if we're not on the Pepper thread, right? But wait a minute, to issue OpenGL calls we need to run on the Pepper thread...

And that's basically the core problem... I'll get to the solution later under "Rendering".


Application Structure

A Nebula3 desktop application on OSX has 3 "important" threads:
  • The "process main thread": (let's call it the event thread), this creates the app window (an OSX-specific restriction), spawns the "game loop thread" and then goes into an event loop, waiting for and occasionally waking up on operating system events
  • The "game thread": this is where the game loop is running, the important part here is that it runs decoupled from the event thread. While the event thread is blocking, waiting for events, the game thread is running uninterrupted in a loop
  • The "render thread": with the new experimental "Twiggy" render pipeline this is a very simple thread behind a push-buffer which receives and executes high-level rendering commands from the main thread, the render thread will starve (currently) if the main thread doesn't issue rendering commands.
A NaCl application looks very similar, but doesn't have a dedicated render thread, instead all the render-thread code is actually running on the Pepper Thread:
  • The "Pepper Thread": This is the starting thread, there's isn't a traditional event loop, instead there's an one-time Init method and a couple of callback methods which NaCl calls for various events (input, focus change, etc...). Also, this is where NaCl callback tasks from the other Nebula3 code are executed.
  • The "game thread": this works exactly like in a normal desktop application, except that it *doesn't* spawn a render thread, instead it will enqueue a callback to the Pepper thread once a whole frame of rendering commands have been written to the push-buffer.
  • The "render thread" is actually running as a callback on the Pepper Thread. But the actual code is identical to other platforms: highlevel rendering commands are pulled from the push buffer and translated to low-level OpenGL calls.
The key idea here is that the game's main thread isn't identical with the process main thread. This is actually a pretty cool pattern for any event-driven platform (on Windows you get around this problem because the game loop may control the Windows event processing through PeekMessage()/DispatchMessage()). On other platforms like OSX, iOS or Android this isn't so simple though, and the above pattern is a very elegant way to workaround the "event-loop vs. game-loop" problem (IMHO).


Loading Data

NaCl offers 2 ways to get data into the engine. One is a wrapper class for loading data from an HTTP server, the other is read/write access to a local filesystem sandbox (basically a wrapper around HTML5's local storage system). A real-world application should use both: HTTP requests to get the data from a web server, and the local storage system for caching to become independent from the browser cache.

Nebula3 has a pretty nifty virtual file system layer which associates URL schemes (like http:, file:, ...) with pluggable file systems, and we wrote a very sophisticated "HTTP filesystem" for Drakensang Online. Apart from that the Nebula3 IO system  is multithreaded. There's a single dispatcher thread which consumes IO requests and schedules them to a number of IO threads which process the request. There can be as many IO requests "in flight" as there are IO threads, which is important for data streaming over slow and unreliable HTTP connections.

The end result is that we can use the exact same code to load data from the local disc or from web servers (in fact there's only one line of code required to switch the "root location").

For the current NaCl port, I only implemented an extremely simple HTTP virtual filesystem using the NaCl URLLoader class without local caching which allows me to pull files from a web server, and which can also simulate blocking IO as long as we're not on the Pepper thread.


Handling Input

This is a no-brainer. The Pepper Thread gets called when input events are available for mouse and keyboard. These get converted to "Nebula3 events" and pushed on a thread-safe queue. There's a design-fault in my current implementation that the render-thread pulls and handles these events, don't do this, because in NaCl this will execute on the Pepper thread (this is still legacy code from the old render pipeline). What *should* happen is, that the "game loop thread" pulls and processes these events.


Rendering

This is an interesting area again because OpenGL calls must happen from the Pepper thread, and this is the area where I spent the most time, and eventually this restriction was an important contributing factor to the new "Twiggy" render pipeline.

Traditionally, Nebula3 uses a "fat render thread". The render thread runs a completely autonomous graphics world, and only gets updates like position changes from the main thread. It works, but has lead to several problems over time which I won't get into here...

When I first dabbled around with NaCl I kept this design and wrote an "OpenGL stream codec". Whenever the original code would issue an OpenGL call, on NaCl this call would be encoded into a binary stream and when the render-thread frame has finished, a Pepper Thread callback would be issued which decodes the command and executes the OpenGL call.

This was basically the right idea but had 2 very serious downsides:
  1. excessive granularity: a real-world game might have thousands of OpenGL calls per frame, and the overhead this would produce is just silly (especially since NaCl has to do the same encode/decode *again*).
  2. excessive syncing: Every OpenGL call which produces a result needs to block until the decoder thread has issued the OpenGL call. This is a problem with resource creation and uniform location lookups.
I recently wrote a new render pipeline which gets rid of the "fat render thread", instead the new render thread just sits behind a push-buffer, and consumes and executes rendering commands. This is the better design because:
  • command granularity is a lot lower since the command protocol is higher level then OpenGL
  • the main thread *never* blocks mid-frame for resource creation, only at most once per frame if no free push buffer is currently available (because the main thread has run too far ahead of the render thread)
For NaCl this is the perfect design: all OpenGL calls happen on the Pepper thread, and the Pepper thread doesn't do any heavy processing; just some simple translation from Nebula3's high-level rendering protocol to low-level OpenGL calls.

Oh, I forgot: remember when I said that the first thing the render thread does is to synchronously load shader files? I got around this problem in a radical way by compiling the shader code into the program. A little command line tool reads the shader files which are otherwise loaded as data files and converts them into a big monolithic C++ source file which sets up the complete shader "database". All other file accesses in the render thread had already been asynchronous, thankfully.

From Here Onward

The only big show-stopping must-have feature that NaCl doesn't provide yet (IMHO) is UDP sockets, or any other type of "lossy but fast" type of communication channel.

I don't care that much about the call-on-main-thread restriction any more, I know there's a solution on the way, but it is only a viable option if the overhead is smaller then the current solution.

A proper sockets solution on the other hand is critical for action-oriented online games. Maybe I'm old-fashioned, and TCP sockets are "good enough" these days, but a guaranteed messaging layer is still overkill for fast-paced action games. It's just not important whether movement commands arrive, but it is a catastrophe if commuication stalls *if* packets are lost.

I looked into WebSockets, and I don't have a lot of praise for them. I think I have a basic understand of the security implications, but I will never understand why WebSockets work the way they do. It looks like they were only designed to make AJAX communication suck less, but not with the type of network communication in mind that any type of multiplayer games require (well maybe turn-based games like Chess or Civilization).

So if I had one wish free to the NaCl team: give us proper sockets, with TCP and UDP support, and create a security model around them, and don't care about HTML5. Ideas: only allow outgoing connections (the "client" in a client-server model), restrict where a socket may connect to (something like crossdomains.xml), etc... I'm no security expert, but I think / hope that it should be possible to create a secure networking system under the standard socket API (or a subset thereof).

Phew, that's it for now. Over and out :)