20 Apr 2014

cmake and the Android NDK

TL;DR: how to build Android NDK applications with cmake instead of the custom NDK build system, this is useful for projects which already use cmake to create multiplatform/cross-compiling build files.

Update: Thanks to thp for pointing out a rather serious bug: packaging the standard shared libraries into the APK should NOT be necessary since these are pre-installed on the device. I noticed that I didn’t set a library search path to the toolchain lib dir in the linker step (-L…) which might explain the crash I had earlier, but unfortunately I can’t reproduce this crash anymore with the old behaviour (no library search path and no shared system libraries in the APK). I’ll keep an eye on that and update the blog post with my findings.


I’ve spent the last 2.5 days adding Android support to Oryol’s build system. This wasn’t exactly on my to-do list until I sorta “impulse-bought” a Nexus7 tablet last Thursday. It basically went like this “hey that looks quite neat for a non-iPad tablet => wow, scrolling feels smooth, very non-Android-like => holy shit it runs my Oryol WebGL samples at 60fps => hmm 179 Euros seems quite reasonable…” - I must say I’m impressed how far the Android “user experience” has come since I last dabbled with it. The UI finally feels completely smooth, and I didn’t have any of those Windows8-Metro-style WTF-moments yet.

Ok, so the logical next step would be to add support for Android to the Oryol build system (if you don’t know what Oryol is: it’s a new experimental C++11 multi-plat engine I started a couple months ago: https://github.com/floooh/oryol).

The Oryol build system is cmake-based, with a python script on top which simplifies managing the dozens of possible build-configs. A build-config is one specific combination of target-platform (osx, ios, win32, win64, …), build-tools (make, ninja, Visual Studio, Xcode, …) and compile-mode (Release, Debug) stored under a descriptive name (e.g. osx-xcode-debug, win32-vstudio-release, emscripten-make-debug, …).

The front-end python script called ‘oryol’ is used to juggle all the build-configs, invoke cmake with the right options, and perform command line builds.

One can for instance simply call:

> ./oryol update osx-xcode-debug

…to generate an Xcode project.

Or to perform a command line build with xcodebuild instead:

> ./oryol build osx-xcode-debug

Or to build Oryol for emscripten with make in Release mode (provided the emscripten SDK has been installed):

> ./oryol build emscripten-make-release

This also works on Windows (32- or 64-bit):

> oryol build win64-vstudio-debug
> oryol build win32-vstudio-debug

…or on Linux:

> ./oryol build linux-make-debug

Now, what I want to do with my shiny new Nexus7 is of course this:

> ./oryol build android-make-debug

This turned out to be harder then usual. But lets start at the beginning:

A cross-compiling scenario is normally well defined in the GCC/cmake world:

A toolchain wraps the target-platform’s compiler tools, system headers and libs under a standardized directory structure:

The compiler tools usually reside in a bin subdirectory, and are called gcc and g++, or in the LLVM world: clang and clang++, sometimes the tools also have a prefix: pnacl-clang and pnacl-clang++), or they have completely different names (like emcc in the emscripten SDK).

Headers and libs are often located in a usr directory (usr/include and usr/lib).

The toolchain headers contain at least the the C-Runtime headers, like stdlib.h, stdio.h and usually the C++ headers (vector, iostream, …) and often also the OpenGL headers and other platform-specific header files.

Finally the lib directory contains precompiled system libraries for the target platform (for instance libc.a, libc++.a, etc…).

With such a standard gcc-style toolchain, cross-compilation is very simple. Just make sure that the toolchain-compiler tools are called instead of the host platform’s tools, and that the toolchain headers and libs are used.

cmake standardizes this process with its so-called toolchain-files. A toolchain-file defines what compilers tools, headers and libraries should be used instead of the ‘default’ ones, and usually also overrides compile and linker flags.

The typical strategy when adding a new target platform to a cmake build system looks like this:

  • setup the target platform’s SDK
  • create a new toolchain file (obviously)
  • tell cmake where to find the compiler tools, header and libs
  • add the right compile and linker flags

Once the toolchain file has been created, call cmake with the toolchain file:

> cmake -G"Unix Makefiles" -DCMAKE_TOOLCHAIN_FILE=[path-to-toolchain-file] [path-to-project]

Then run make in verbose mode to check whether the right compiler is called, and with the right options:

> make VERBOSE=1

This approach works well for platforms like emscripten or Google Native Client. Some platforms require a bit of additional cmake-magic, a Portable Native Client executable for instance must be “finalized” after it has been linked. Additional build steps like these can be added easily in cmake with the add_custom_command macro.

Integrating Android as a new target platform isn’t so easy though:

  • the Android SDK itself only allows to create pure Java applications, for C/C++ apps, the separate Android NDK (Native Development Kit) is required
  • the NDK doesn’t produce complete Android applications, it needs the Android Java SDK for this
  • native Android code isn’t a typical executable, but lives in a shared library which is called from Java through JNI
  • the Android SDK and NDK both have their own build systems which hide a lot of complexity
  • …this complexity comes from the combination of different host platforms (OSX, Linux, Windows), target API levels (android-3 to android-19, roughly corresponding to Android versions), compiler versions (gcc4.6, gcc4.9, clang3.3, clang3.4), and finally CPU architectures and instruction sets (ARM, MIPS, X86, with several variations for ARM (armv5, armv7, with or without NEON, etc…)
  • C++ support is still bolted on, the C++ headers and libs are not in their standard locations
  • the NDK doesn’t follow the standard GCC toolchain directory structure at all

The custom build system coming with the NDK does a good job to hide all this complexity, for instance it can automatically build for all CPU architectures, but it stops after the native shared library has been compiled: it cannot create a complete Android APK. For this, the Android Java SDK tools must be called from the command line.

So back to how to make this work in cmake:

The plan looks simple enough:

  1. compile our C/C++ code into a shared library instead of an executable
  2. somehow get this into a Java APK package file…
  3. …deploy APK to Android device and run it

Step 1 starts rather innocent, create a toolchain file, look up the paths to the compiler tools, headers and libs in the NDK, then lookup the compiler and linker command line args by watching a verbose build. Then put all this stuff into the right cmake variables. At least this is how it usually works. Of course for Android it’s all a bit more complicated:

  • first we need to decide on a target CPU architecture and what compiler to use. I settled for ARM and gcc4.8, which leads us to […]/android-ndk-r9d/toolchains/arm-linux-androideabi-4.8/prebuilt
  • in there is a directory darwin-x86_64 so we need separate paths by host platform here
  • finally in there is a bin directory with the compiler tools, so GCC would be for instance at [..]/android-ndk-r9d/toolchains/arm-linux-androideabi-4.8/prebuilt/darwin-x86_64/bin/arm-linux-androideabi-gcc
  • there’s also an include, lib and share directory but the stuff in there definitely doesn’t look like system headers and libs… bummer.
  • the system headers and libs are under the platforms directory instead: [..]/android-ndk-r9d/platforms/android-19/arch-arm/usr/include, and [..]/android-ndk-r9d/platforms/android-19/arch-arm/usr/lib
  • so far so good… put this stuff into the toolchain file and it seems to compile fine – until the first C++ header must be included - WTF?
  • on closer inspection, the system include directory doesn’t contain any C++ headers, and there’s different C++ lib implementations to choose from under [..]/android-ndk-r9d/sources/cxx-stl

This was the point where was seriously thinking about calling it a day until I stumbled across the make-standalone-toolchain.sh in build/tools. This is a helper script which will build a standard GCC-style toolchain for one specific Android API-level and target CPU:

sh make-standalone-toolchain.sh –-platform=android-19 
  –-ndk-dir=/Users/[user]/android-ndk-r9d
  –-install-dir=/Users/[user]/android-toolchain 
  –-toolchain=arm-linux-androideabi-4.8
  --system=darwin-x86_64

This will extract the right tools, headers and libs, and also integrate C++ headers (by default gnustl, but can be selected with the –stl option). When the script is done, a new directory ‘android-toolchain’ has been created which follows the GCC toolchain standard, and is much easier to integrate with cmake:

The important directories are:
- [..]/android-toolchain/bin, this is where the compiler tools are located, these are still prefixed though (e.g. arm-linux-androideabi-gcc
- [..]/android-toolchain/sysroot/usr/include CRT headers, plus EGL, GLES2, etc…, but NOT the C++ headers
- [..]/android-toolchain/include the C++ headers are here, under ‘c++’
- [..]/android-toolchain/sysroot/usr/lib .a and .so system libs, libstc++.a/.so is also here, no idea why

After setting these paths in the toolchain file, and telling cmake to create shared-libs instead of exes when building for the Android platform I got the compiler and linker steps. Instead of a CoreHello executable, I got a libCoreHello.so. So far so good.

Next step was to figure out how to get this .so into a APK which can be uploaded to an Android device.

The NDK doesn’t help with this, so this is where we need the Java SDK tools, which uses yet another build system: ant. From looking at the SDK samples I figured out that it is usually enough to call ant debug or ant release within a sample directory to build an .apk file into a bin subdirectory. ant requires a build.xml file which defines the build tasks to perform. Furthermore, Android apps have an embedded AndroidManifest.xml file which describes how to run the application, and what privileges it requires. None of these exist in the NDK samples directories though…

After some more exploration it became clear: The SDK has a helper script called android which is used (among many other things) to setup a project directory structure with all required files for ant to create a working APK:

> android create project
    --path MyApp
    --target android-19
    --name MyApp
    --package com.oryol.MyApp
    --activity MyActivity

This will setup a directory ‘MyApp’ with a complete Android Java skeleton app. Run ‘ant debug’ in there and it will create a ‘MyApp-debug.apk’ in the ‘bin’ subdirectory which can be deployed to the Android device with ‘adb install MyApp-debug.apk’, which when executed displays a ‘Hello World, MyActivity’ string.

Easy enough, but there are 2 problems, first: how to get our native shared library packaged and called?, and second: the Java SDK project directory hierarchy doesn’t really fit well into the source tree of a C/C++ project. There should be a directory per sample app with a couple of C++ files and a CMakeLists.txt file and nothing more.

The first problem is simple to solve: the project directory hierarchy contains a libs directory, all .so files in there will be copied into the APK by ant (to verify this: a .apk is actually a zip file, simply changed the file extension to zip and peek into the file). One important point: the lib directory contains one sub-directory-level for the CPU architecture, so once we start to support multiple CPU instruction sets we need to put them into subdirectories like this:

FlohOfWoe:libs floh$ ls
armeabi     armeabi-v7a mips        x86

Since my cmake build-system currently only supports building for armeabi-v7a I’ve put my .so file in the armeabi-v7a subdirectory.

Now I thought that I had everything in place, I got an APK file with my native code .so lib in it, I used the NativeActivity and the android_native_app_glue.h approach, and logged out a “Hello World” to the system log (which can be inspected with adb logcat from the host system).

And still the App didn’t start, instead this showed up in the log:

D/AndroidRuntime(  482): Shutting down VM
W/dalvikvm(  482): threadid=1: thread exiting with uncaught exception (group=0x41597ba8)
E/AndroidRuntime(  482): FATAL EXCEPTION: main
E/AndroidRuntime(  482): Process: com.oryol.CoreHello, PID: 482
E/AndroidRuntime(  482): java.lang.RuntimeException: Unable to start activity ComponentInfo{com.oryol.CoreHello/android.app.NativeActivity}: java.lang.IllegalArgumentException: Unable to load native library: /data/app-lib/com.oryol.CoreHello-1/libCoreHello.so
E/AndroidRuntime(  482):    at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2195)

This was the second time where I banged my head against the wall for a while until I started to look into how linker dependencies are resolved for the shared library. I was pretty sure that I gave all the required libs on the linker command line (-lc -llog -landroid, etc), the error was that I assumed that these are linked statically. Instead default linking against system libraries is dynamic. The ndk-depends helps in finding the dependencies:

localhost:armeabi-v7a floh$ ~/android-ndk-r9d/ndk-depends libCoreHello.so 
libCoreHello.so
libm.so
liblog.so
libdl.so
libc.so
libandroid.so
libGLESv2.so
libEGL.so

This is basically the list of .so files which must be contained in the APK. After I copied these to the SDK project's lib directory, together with my libCoreHello.so. Update: These shared libs are not supposed to be packaged into the APK! Instead the standard system shared libraries which already exist on the device should be linked at startup.

I finally saw the sweet, sweet ‘Hello World!’ showing up in the adb log!

But I skipped one important part: so far I fixed everything manually, but of course I want automated Android batch builds, and without having those ugly Android skeleton project files in the git repository.

To solve this I did a bit of cmake-fu:

Instead of having the Android SDK project files committed into version control, I’m treating these as temporary build files.

When cmake runs for an Android build target, it does the following additional steps:

For each application target, a temporary Android SDK project is created in the build directory (basically the ‘android create project’ call described above):

# call the android SDK tool to create a new project
execute_process(COMMAND ${ANDROID_SDK_TOOL} create project
                --path ${CMAKE_CURRENT_BINARY_DIR}/android
                --target ${ANDROID_PLATFORM}
                --name ${target}
                --package com.oryol.${target}
                --activity DummyActivity
                WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR})

The output directory for the shared library linker step is redirected to the ‘libs’ subdirectory of this skeleton project:

# set the output directory for the .so files to point to the android project's 'lib/[cpuarch] directory
set(ANDROID_SO_OUTDIR ${CMAKE_CURRENT_BINARY_DIR}/android/libs/${ANDROID_NDK_CPU})
set_target_properties(${target} PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${ANDROID_SO_OUTDIR})
set_target_properties(${target} PROPERTIES LIBRARY_OUTPUT_DIRECTORY_RELEASE ${ANDROID_SO_OUTDIR})
set_target_properties(${target} PROPERTIES LIBRARY_OUTPUT_DIRECTORY_DEBUG ${ANDROID_SO_OUTDIR})

The required system shared libraries are also copied there: (DON’T DO THIS, normally the system’s standard shared libraries should be used)

# copy shared libraries over from the Android toolchain directory
# FIXME: this should be automated as post-build-step by invoking the ndk-depends command
# to find out the .so's, and copy them over
file(COPY ${ANDROID_SYSROOT_LIB}/libm.so DESTINATION ${ANDROID_SO_OUTDIR})
file(COPY ${ANDROID_SYSROOT_LIB}/liblog.so DESTINATION ${ANDROID_SO_OUTDIR})
file(COPY ${ANDROID_SYSROOT_LIB}/libdl.so DESTINATION ${ANDROID_SO_OUTDIR})
file(COPY ${ANDROID_SYSROOT_LIB}/libc.so DESTINATION ${ANDROID_SO_OUTDIR})
file(COPY ${ANDROID_SYSROOT_LIB}/libandroid.so DESTINATION ${ANDROID_SO_OUTDIR})
file(COPY ${ANDROID_SYSROOT_LIB}/libGLESv2.so DESTINATION ${ANDROID_SO_OUTDIR})
file(COPY ${ANDROID_SYSROOT_LIB}/libEGL.so DESTINATION ${ANDROID_SO_OUTDIR})

The default AndroidManifest.xml file is overwritten with a customized one:

# override AndroidManifest.xml 
file(WRITE ${CMAKE_CURRENT_BINARY_DIR}/android/AndroidManifest.xml
    "<manifest xmlns:android=\"http://schemas.android.com/apk/res/android\"\n"
    "  package=\"com.oryol.${target}\"\n"
    "  android:versionCode=\"1\"\n"
    "  android:versionName=\"1.0\">\n"
    "  <uses-sdk android:minSdkVersion=\"11\" android:targetSdkVersion=\"19\"/>\n"
    "  <uses-feature android:glEsVersion=\"0x00020000\"></uses-feature>"
    "  <application android:label=\"${target}\" android:hasCode=\"false\">\n"
    "    <activity android:name=\"android.app.NativeActivity\"\n"
    "      android:label=\"${target}\"\n"
    "      android:configChanges=\"orientation|keyboardHidden\">\n"
    "      <meta-data android:name=\"android.app.lib_name\" android:value=\"${target}\"/>\n"
    "      <intent-filter>\n"
    "        <action android:name=\"android.intent.action.MAIN\"/>\n"
    "        <category android:name=\"android.intent.category.LAUNCHER\"/>\n"
    "      </intent-filter>\n"
    "    </activity>\n"
    "  </application>\n"
    "</manifest>\n")

And finally, a custom build-step to invoke the ant-build tool on the temporary skeleton project to create the final APK:

if ("${CMAKE_BUILD_TYPE}" STREQUAL "Debug")
    set(ANT_BUILD_TYPE "debug")
else()
    set(ANT_BUILD_TYPE "release")
endif()
add_custom_command(TARGET ${target} POST_BUILD COMMAND ${ANDROID_ANT} ${ANT_BUILD_TYPE} WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/android)

With all this in place, I can now do a:

> ./oryol make CoreHello android-make-debug

To compile and package a simple Hello World Android app!

What’s currently missing is a simple wrapper to deploy and run an app on the device:

> ./oryol deploy CoreHello
> ./oryol run CoreHello

These would be simple wrappers around the adb tool, later this should of course also work for iOS apps.

Right now the Android build system only works on OSX and only for the ARM V7A instruction set, and there’s no proper Android port of the actual code yet, just a single log message in the CoreHello sample.

Phew, that’s it! All this stuff is also available on github (https://github.com/floooh/oryol/tree/master/cmake).

Written with StackEdit.

2 Feb 2014

It's so quiet here...

…because I’m doing a lot of weekend coding at the moment. I basically caught the github bug over the holidays:

http://www.github.com/floooh

I’ve been playing around with C++11, python, Vagrant, puppet and chef recently:

C++11:

  • I like: move semantics, for (:), variadic template arguments, std::atomic, std::thread, std::chrono, possibly std::function and std::bind (haven’t played around with these yet)
  • (still) not a big fan of: auto, std containers, exceptions, rtti, shared_ptr, make_shared
  • thread_local vs __thread vs __declspec(thread) is still a mess across Clang/OSX, GCC and VisualStudio
  • the recent crazy-talk about integrating a 2D drawing API into the C++ standard gives me the shivers, what a terrible, terrible idea!

Python

  • best choice/replacement for command-line scripts and asset tools (all major 3D modelling/animation tools are python-scriptable)
  • performance of the standard python interpreter is disappointing, and making something complex like FBX SDK work in alternative Python compilers is difficult or impossible

Vagrant plus Puppet or Chef

  • Vagrant is extremely cool for having an isolated cross-compilation Linux VM for emscripten and PNaCl, instead of writing a readme with all the steps required to get a working build machine, you can simply check-in a Vagrantfile into the versioning system repository, and other programmers simply do a ‘vagrant up’ and have a VM which ‘just works’
  • the slow performance of shared directories on VirtualBox requires some silly workarounds, supposedly this is better with VMWare Fusion, but haven’t tried yet
  • Puppet vs Chef are like Coke vs Pepsi for such simple “stand-alone” use-cases. Chef seems to be more difficult to get into, but I think in the end it is more rewarding when trying to “scale up”

Written with StackEdit.

20 Dec 2013

Asset loading in emscripten and PNaCl

Loading data from a file on disk doesn’t look like a big deal in a normal C application:

int main() {
    // open file for reading
    FILE* fh = fopen("filename", "rb");
    if (fh) {

        // read some bytes
        char buffer[128];
        fread(buffer, sizeof(buffer), 1, fh);

        // close the file
        fclose(fh);
        fh = 0;
    }
    return 0;   
}

When doing a real-world game this simple approach has a couple of problems:

  • blocking: The above code is blocking, when reading from a fast hard disk this is probably not even noticeable, but try loading from a DVD or Bluray disk or some sort of network drive over a slow connection and the game loop will stutter
  • hard-coded paths: The concept of a current directory is often not portable, you can’t depend on the current directory being set to where your executable is. It is better to establish an absolute root location and have all filename paths in the game relative to that (of course how to establish this root location is platform-dependent again, for instance get the absolute path to the executable, and go on from there)
  • can’t use different transfer protocols: the above code works fine for local filesystems, but not loading data from a web- or ftp-server, and operations like creating a new file, or randomly seeking in a file may not be available with other protocols.

It is a good idea to restrict the type of file operations that a game can use, e.g.:

  • do we really need write and create access? An offline game may need to write save-game files and options, while an online game probably doesn’t need access to the local file system at all.
  • do we really need random seek? Randomly seeking in a file can be either impossible (HTTP) or slow because some mechanical device must be moved around, it’s often better to read a file straight into memory and seek there or to avoid such operations at all.
  • do we really need to iterate directory content? again, this can be either expensive (mechanical storage device) or impossible (in plain HTTP for instance)
  • do we really need free-form file paths? Games usually need to access very few places in the file system (the asset directory which is usually read-only, and maybe some sort of per-user writable location for settings and save-games)
  • do we really need access to file attributes? Stuff like last modification time, ownership, readable/writable. Usually this is not needed.
  • do we really need the concept of a “current directory”? This can be tricky for portability, and some platforms don’t have the concept of a current working directory at all

That’s a lot of features we don’t need in a game and which are also often not provided by web-based runtime platforms like PNaCl and JS. It helps to look at the HTTP protocol for inspiration, since that is where we need to load our data from anyway in the web scenario:

  • file system paths become URLs
  • only one read operation GET, which usually provides an entire file (but can also load a part of a file)
  • no directory iteration
  • no “write access” unless specifically allowed by the server
  • state-less, no current directory or current read position
  • operations can take very long (seconds or even minutes)

For a game which wants to load its asset from the web the IO system should be designed around those restrictions.

As an example, here’s an overview of the Nebula3 IO system:

  • all paths are URLs: Not much to say about this :)
  • a single root location: At application start, a root location is established, this is usually a file:// URL pointing to the app’s installation directory, but can be overriden to point (for instance) to an http:// URL. Loading all data from a web server instead of the local hard disk is done with a single line of code which sets a different root location.
  • Amiga assigns as path aliases: A filesystem path to a texture looks like this in N3: tex:walls/brickwall.dds, where the tex: is an “AmigaOS assign” which is replaced with an absolute path, incorporating the root directory.
  • all paths are absolute: there is no concept of a “current directory” in Nebula3, instead all paths resolve to an absolute location at runtime by replacing assigns in the path.
  • pluggable “virtual filesystem” modules associated with the URL scheme: URLs starting with file:// are handled by a different file system module than http://, plus Nebula3 apps can plug in their own filesystem modules if they want
  • stream objects, stream readers and stream writers: this is interesting in the web context only because there’s a MemoryStream object which is used to store and transfer downloaded data in RAM
  • asynchronous IO is really simple: more on that later in this post :)

Since Nebula3 is also used as a command-line-tools framework, the IO subsystem is a bit of a hybrid, which in hindsight was a design fault. There are still all these writing and file creation operations, blocking IO, directory walking etc… which makes the API quite bloated. In a new engine I would probably strictly separate the two scenarios, use the engine as a game framework only, which only supports very simple asynchronous read operations, and write the tools with another framework (or even other language, like python).

Asynchronous IO in Nebula3

Let’s look at async IO in Nebula3 a bit closer since this is the most interesting feature for web-based platforms. This is based on the “non-blocking future” pattern (or whatever you wanna call it) and depends on a frame-driven instead of event- or callback-driven application architecture.

Here’s some pseudo code:

void StartLoading() {
    // To start loading data we need to create an 
    // IO request object and "send it off" to the
    // IoInterface singleton for asynchronous processing
    Ptr<IO::ReadStream> req = IO::ReadStream::Create();
    req->SetURI("tex:walls/brickwall.dds");
    IoInterface::Singleton()->Send(req);

    // The IoRequest is now "in flight" and will contain
    // a result at some point in the future. Because we need
    // to check for completion in some later frame we need to
    // store the smart pointer somewhere
    this->pendingRequest = req;

    // ok, we're done for this frame...
}

void HandlePendingRequest() {
    // this function must be called regularly (e.g. per
    // frame) to check whether the async loading operation
    // has finished
    if (this->pendingRequest.isvalid() &&
        this->pendingRequest->Handled()) {

        // ok, the request has been completed, if 
        // the file was loaded successfully we get
        // a MemoryStream object with its content
        if (this->pendingRequest->GetSuccess()) {

            // actually load the data from the memory
            // stream and throw the request object away,
            // since all file data is in memory, we can
            // actually use the normal open/seek/read/close
            // pattern on the stream object
            this->LoadFromStream(this->pendingRequest->GetStream());

            // delete the request object, 
            // remember, this is a smart pointer :)
            this->pendingRequest = 0;
        }
    }
}

There may be less verbose or more elegant versions of this code of course, but the basic idea is that you start loading a file in one frame, and then need to check in the following frames if loading has finished (or failed), and get the completely loaded data in a memory buffer which can be parsed with “traditional” read and seek functions (and which is very fast since everything happens in memory).

This implies that the engine needs to know what to do while some required data has not been loaded yet. For a graphics pipeline this is quite simple by either rendering nothing or some placeholder while the data is still loading.

But there are cases where the code cannot progress without important data being loaded, or where it would be very tricky or impossible to implement asynchronous IO (for instance when integrating complex 3rd party libraries like sqlite).

If we could simply block this wouldn’t be a problem: the worst thing that would happen is that our game loop would stutter, but on web platforms we cannot simply block the main thread (it is easier on PNaCl where it is recommended to move the game loop into a separate thread, which then can block waiting for the main thread to process asynchronous IO requests).

For Nebula3 I fixed this with an additional application object state called the “Preloading Phase”. The idea is that the app enters this state outside of the normal game loop (for instance while displaying a loading screen), and during this state, populates a simple in-memory filesystem (basically just a lookup-table with URLs as keys and MemoryStream objects as values) with the asynchronously loaded data. When all data has been loaded (or failed to load), the app will leave the preloading phase (and hide the loading screen) and synchronous loader code will transparently get the data from the in-memory file system instead of starting an actual asynchronous IO request. Since all this preloaded data resides in memory this means of course that only small data and few files should be preloaded, and the majority of data should be asynchronously streamed on demand during the game loop. It’s really only a workaround for the few cases where synchronous access is absolutely necessary.

More details about here in one of my presentations: http://www.slideshare.net/andreweissflog3/gdce2013-cpp-ontheweb

emscripten and PNaCl details

Ok, almost done!

For the emscripten and PNaCl platforms I basically wrote a simple Nebula3 filesystem module which fires HTTP GET requests through he respective emscripten and PNaCl API calls, and copies the received data into MemoryStream objects, it’s only a few hundred lines of code each.

The main difference between the two platforms lies in the use of threading:

  • PNaCl works like “traditional” platforms, there are a number of IO threads (about 10, but that’s tweakable) each of them processes one IO request at a time, so that as many IO requests can be in flight as there are IO threads. Those threads also directly handle processing of the received data like decompression.
  • In emscripten, the IO calls (sending a HTTP request, and the callback when the response has been received) is handled on the main thread, but the expensive processing (e.g. decompression) of the received data is handed over to a WebWorker pool (usually 4 WebWorker threads). There can still be multiple IO requests in flight because the IO system doesn’t “wait” for an IO request to finish before firing a new one (but it is still throttled to restrict the number of requests in flight in case a lot of requests arrive in a very short time period).

The actual code implementation is straightforward so I’ll spare you the source code samples. The respective class in PNaCl is called pp::URLLoader, and emscripten offers a whole set of rather specialized C functions which all start with emscripten_async_wget. Both fire an HTTP request (emscripten does an XmlHttpRequest, and PNaCl presumably under the hood as well - this has some unfortunate cross-domain implications), and invoke callbacks on failure or when data has arrived. PNaCl needs a bit more coding work since data is received in chunks (and the receive callback can be called multiple times), while emscripten waits until all data is received before calling the received-callback once.

emscripten has more options to integrate the data with the web page DOM (for instance it can automatically create DOM image objects from downloaded image files), and it also has a very advanced CRT IO emulation layer (so you actually can directly use fopen/fclose after the data has been downloaded or preloaded), but I haven’t looked into these advanced concepts very closely since Nebula3 already does a lot of this layering itself.

There’s a similar filesystem layer for NaCl called nacl-mounts, but similarly to emscripten I didn’t look into this very closely since the low-level URL loading functions were a better fit for N3.

That’s it for today, have a nice Christmas everyone :)

Written with StackEdit.

3 Nov 2013

Messing around with MESS (and JSMESS)

And now for something completely different:

Since I’m dabbling with emscripten I’ve had this idea in my head to write or port a KC85/3 emulator, so that I could play the games I wrote as a kid directly in the browser. The existing KC85 emulators I was aware of are not trival to port, they either depend on x86 inline assembly, or are hardwired to a specific UI framework (if you read German, here’s an overview on what’s out there: http://www.kc85emu.de/Emulatoren/Emulatoren.htm )

About 2 weeks ago I started to look around more seriously for a little side project to spent my 3 weeks of vacation around Christmas (I need to burn my remaining vacation days, in Germany employees are basically required by law to take all their vacation - tough shit ;) My original plan was to cobble together a minimal emulator just enough to run my old games: Take an existing Z80 CPU emulator like the one from FUSE, hack some keyboard input and video output and go on from there.

Thankfully I then had a closer look at MESS. I always thought that MESS could only emulate the most popular Western game machines like the C64 or Atari 400, but it turns out that this beast can emulate pretty much any computer that ever existed (between 600 and 1700, depending on how you count), it even has support for the PDP-1 from the early 60’s! When searching through the list of emulated systems here (http://www.progettoemma.net/mess/sysset.php) I stumbled over the following entries:

  • HC900 / KC 85/2
  • KC 85/3
  • KC 85/4
  • KC Compact
  • Lerncomputer LC 80
  • KC 85/1
  • Z1013
  • Poly-Computer 880
  • BIC A5105

That’s the entire list of East-German “hobby computers”. But wait, there’s more:

  • Robotron PC-1715
  • A5120
  • A7150

These were GDR office computers. The 1715 was a CP/M compatible 8-bit PC, and the A7150 was a not-quite-compatible x86 IBM-PC clone. I’m actually not sure what the 5120 was, just that it was a big ugly box with built-in mono-chrome monitor.

Since all those systems are marked as “not working” in this list I wasn’t too enthusiastic yet, but I had to be sure. The latest MESS compiled out of the box on OSX, and it was easy to find the right ROM images in the net. So I started MESS with:

./mess64 kc85_3 -window

To my astonishment I watched a complete boot sequence into the operating system:

KC85/3 system shell

Excite!

I also came across the JSMESS project before, which is a port of MESS to Javascript using emscripten. So my next step was to compile JSMESS and see whether the KC emulator works there as well. It booted, but didn’t accept any keyboard input :( After comparing the source code it dawned on me that JSMESS was far behind MESS, about 2 years to be exact. But this was a good excuse to dive a bit deeper into how MESS actually works, and the deeper I crawled the more impressed I became.

MESS had been derived from the well known MAME arcade machine emulator project, with the goal to extend the emulation to “real computers”. Later MESS merged with MAME again, so that today both projects compile from the same code base.

A specific emulated machine is called a “system driver” and can be described by just a few lines of code listing what CPU to use, the RAM and ROM areas, what ROM image to load, and what memory-mapped IO registers exist. You’ll also have to provide several callback routines for handling reads and write to IO addresses and to convert the system’s video memory into a standardized bitmap representation. For a very simple computer built from standard chips a working emulator can be plugged together in a couple of hours, but writing a complete and “cycle-perfect” emulator is of course still a tough challenge, especially if custom chips are used. The overall amount of research and implementation work that went into MESS is almost overwhelming. Pretty much every computer, every mass-produced chip that ever existed is emulated in there, often with all of their undocumented quirks!

Ok, back to the KC85/3: after analyzing the source code of the KC driver it quickly became clear that the keyboard input emulation was the toughest part, since this was where the original KC engineers were very “creative”. As far as I understood the several pages of email exchange which are included as comment in the MESS KC driver, the KC keyboard used a very exotic TV remote control chip to send serial pulses to the main unit (the KC had an external keyboard connected with a “very thin” wire, so it was very likely a simple serial connection). The base unit which received the signal didn’t have a “decoder chip” however, but used its universal Z80-CTC (timer) and -PIO (in/out) chips to decode the signal. Emulating this behaviour seems to be very tricky since a lot of KC emulators have yanky keyboard input (not registering key presses, or inserting random key codes when typing fast, etc…).

Since I didn’t get this to work reliably even after back-porting the latest keyboard input code from MESS (which somewhat works, but still has problems with random keys triggering), I decided to be a bit naughty and implement a shortcut (the “cycle-perfect” emulator purists will likely kill me for this heresy):

After the KC-ROM reads a keyboard scan-code through this tricky serial-pulse decoding described above, it converts the scan code to ASCII and writes it to memory location 0x1FD, and then sets bit 0 in memory location 0x1F8 to signal that a new key code is available. It also maintains a keyboard repeat counter in address 0x1FA. All of this can be gathered from the keyboard handling code in ROM (and is also explained in that very informative, very long comment in the source code). I’m basically “shortcutting” this with C code and write the ASCII code directly to 0x1FD and also handle the key repeat directly in C. The tricky serial decoding stuff in ROM is never triggered this way. With this hack the keyboard input is fairly responsive (sometimes the first key is swallowed, don’t know yet what’s up with this).

Next I had to fix the RGB colors which were off both in MESS and JSMESS (bright yellow-green looked more like puke-yellowish, and all other “inbetween colors” were off too), and I finally back-ported (and also optimized a bit) the video memory mapping code from MESS to JSMESS.

You can check all my changes here on GitHub: https://github.com/floooh/jsmess/tree/floh Right now a “reboot” is going on in the JSMESS project to bring it uptodate with the latest MESS version. I’ll wait with any pull-requests until this is finished and I refreshed my own fork as well. Also I will not try to contribute my “dirty hacks” back to the main code base of course, the MESS guys are right to insist on perfect emulation instead of shortcut hacks like the keyboard hack described above. But my (rather egoistic) main goal is to get my own games running on my web page, so I think I can get away with such hacks in my own fork.

The next challenge is to get all of my games running in JSMESS. This is harder than I thought. Part of the problem is that there exist several memory dump files which are not original. I found dump files with the wrong entry address, and dumps where others have implemented cheats and trainers. So far I’ve got 3 out of 7 games working. Getting the remaining 4 games into working condition might take a while since I may have to do some hardcore assembly debugging to find out what’s wrong.

Thankfully MESS has a completely assembler-level debugger built-in:

MESS Debugger

Re-constructing the program flow of this 25-year-old game which I wrote in machine code (instead of using a assembler) is actually quite a lot of fun, much easier than trying to reconstruct a program which was written in a high-level language and compiled to machine code. Subroutines often start at “even” addresses, and have a block of NOP instructions appended, in case I needed to add instructions when fixing bugs, strings are usually embedded right into the instruction sequence instead of a central “string pool”. Analyzing the program flow comes down to figuring out what a given subroutine does (drawing a sprite? handling keyboard input? updating the hiscore display?), and what variables are stored at specific memory addresses (for instance current live counter, current position, and so on).

What’s remarkable is how small the game code actually is, even though it is not very dense with all those NOPs inbetween and a lot of redundant code segments (e.g. I didn’t specifically care about code size). Of the about 12kByte of my (very simple) Pacman clone, only about 3.5 kByte are actual code. The entire game code fits on a single screen (marked in yellow here):

enter image description here

Finally, here’s the current result of this work: a JSMESS KC85/3 and KC85/4 emulator, and 3 of my old games running directly in the browser. Don’t try this on an iPhone though (or generally Safari). Firefox or an uptodate Chrome works very well:

http://www.flohofwoe.net/history.html

Written with StackEdit.

8 Oct 2013

Farewell DirectX

Today I ported the OpenGL rendering code in Nebula3's bleeding edge branch back to Windows:

enter image description here

This is remarkable in 2 ways:

  1. It's the first time since around 1997 that I ported a significant amount of code to Windows. Usually it was from Windows towards another platform.
  2. This is also the end of DirectX in our code base (well almost, we're still using the xnamath.h header, which is a standalone header and doesn't require linking against any DX DLL).

Why do I think that this is remarkable:

It is the end of an era! In 1997 I ported Urban Assault from DOS to Windows95 and Direct3D5. This was just around the time when Windows started its career as a gaming platform. D3D5 was the first D3D version which didn't completely suck because it had the new DrawPrimitive API, before that, rendering commands had to be issued through an incredibly arcane "execute buffer" concept (theoretically a good idea if GPUs would have been able to directly parse this buffer, but terrible to use in real-world code). The Urban Assault port to D3D was pretty inefficient since we ported from a software rasterizer (with perspective correction and all that cool shit), and if I remember correctly we issued every single triangle through a single DrawPrimitive call (although that wasn't such a big deal at the time). And the only graphics card which had somewhat acceptable D3D support was the RIVA128 from an underdog company called nVidia (this was before their breakthrough TNT2), and the top dog was the 3dfx Voodoo2 which had much better support for Glide then for D3D. But since UA was published by Microsoft we had to be D3D-exclusive of course.

Since 1998 Direct3D was our primary rendering API, I dabbled around with OpenGL from time to time, but nothing serious. We made the jump to D3D7, D3D8, and finally D3D9. Each new version sucked less and less, and D3D9 is still a really good API. We never made the jump to D3D10 because of Microsoft's exceptionally stupid decision to not back-port D3D10 to Windows XP from Vista, and since Nebula was never about high-end rendering features but instead running on a broad range of old hardware we could never justify to add D3D10 support, since we couldn't give up D3D9.

And as silly as it sounds, this boneheaded Microsoft decision from 7 years ago is one important reason why I'm ditching D3D today. World-wide, WindowsXP is the fastest growing Windows version. It's growing a lot faster than Windows8. Don't believe me? See the Unity hardware stats page for a scary reality check:

http://stats.unity3d.com/web/index.html

The Chinese Dragon has awoken, and it is running exclusively on XP. WindowsXP is also very popular in Eastern Europe and the Middle East. So if you want to support markets east and south of Middle Europe you're basically fucked if you don't support XP.

Another important reason is streamlining the code base. The currently "interesting platforms" (browser and mobile) are all running some variant of POSIX+OpenGL. In this new world the Windows APIs are the exotics, and Microsoft doesn't exactly help the cause by repeating their errors of the past (limiting Windows Store apps to D3D11). By using a single rendering code base (and especially shader code base!) across all platforms we're reducing our technical debt in the future.

I have a fallback plan of course, because there are a few risks:

  • What if OpenGL driver quality on Windows is as bad as everybody says?
  • What if we need to support native Windows Store apps (as opposed to a WebGL version running embedded in a browser)?

The fallback plan has 2 stages:

  1. Use ANGLE which layers OpenGL ES2 with some important extensions over D3D9 or D3D11, this is the preferred solution since we don't need to touch the render layer code and shader library.
  2. If ANGLE isn't good enough, write native D3D9 and D3D11 ports of the CoreGraphics2 subsystem, and optimally use some API agnostic shader language wrapper. This wouldn't be as bad as it sounds, each wrapper would have around 7k lines of code, which is about 4.5% of Nebula3 in its minimal useful configuration (which is about 150k lines of code, depending on which other N3 modules are added this can go up to 500k lines of code).

OpenGL isn't perfect of course. It has some incredibly crufty corners, most of those have been fixed in through extensions and newer GL versions over time, but realistically we can't use anything newer then OpenGL ES2 with very few extensions for the renderer's base feature set.

When I removed the DirectX library stubs from the Nebula3 CMake files this afternoon I really had to stop and think for a moment. Who knows, maybe in a future blog post in about 15 years I will write "this was around the time when Windows became irrelevant as a gaming platform"? ;)

Written with StackEdit.

7 Sep 2013

emscripten and PNaCl: App entry in PNaCl

The is the followup to last week's post about application entry in emscripten. If you haven't done yet I would recommend reading this first before continuing.

2 main points to keep in mind about the (P)NaCl platform:

  1. Blocking the main thread will block the entire browser tab.
  2. NaCl has true threading support which can be used to workaround these blocking limitations.

Point (1) is the same as on the emscripten platform, and point (2) is the big difference to emscripten.

In a Nebula3/PNaCl application, the main function looks the same as on any other platform (I'm using emscripten's "simulate_infinite_loop" approach now):

#include "myapplication.h"

ImplementNebulaApplication();

void
NebulaMain(const Util::CommandLineArgs& args)
{
    MyApplication app;
    app.SetCommandLineArgs(args);
    app.StartMainLoop();
}

However under the hood, the startup process until the NebulaMain() function is entered is completely different from other platforms, since PNaCl doesn't have a main() function. Instead PNaCl has the concept of application Module and Instance objects. This is where the plugin-nature of a PNaCl app shines through. There is a single Module object created on a web page containing a PNaCl app, and for each <embed> element on the page, one Instance object. In reality though, most of the time there will be exactly one Module and one Instance object, so the distinction doesn't really matter.

PNaCl offers two different startup APIs for C and C++. The C++ API is easier to grasp IMHO, so I'll just concentrate on this (this dual C/C++ nature continues through the whole NaCl API, there's a pure C API, extended by a slightly higher-level C++ API.

Hooking up your code to NaCl basically means to write 2 subclasses, one deriving from pp::Module, and one deriving from pp::Instance, and the NaCl runtime will then call into these classes through virtual methods for initialisation and notifying the application about events.

But first things first:

Everything starts at a global C Function called pp::CreateModule() which you must provide, and which must return a new object of your pp::Module subclass (called N3NaclModule in this case):

namespace pp
{
    Module* CreateModule()
    {
        return new N3NaclModule();
    };
}

Although this is the very first function that NaCl will call, you should be aware that initialisers in the global scope (static objects) will already be initialised and have had their constructors called at this point.

The main job of the derived Module class is to create Instance objects, but we can also put some one-time init code in there. There's a pair of functions to initialise and shutdown GL rendering called glInitializePPAPI() and glTerminatePPAPI(). The only rule is that no GL calls must be made outside these two functions, so I guess we could also put them somewhere else, as long as is guaranteed that they are not called multiple times.

But - the most important method in the derived Module class is the factory method for Instance objects called CreateInstance. In my case, I have created a subclass of pp::Instance called NACL::NACLBridge.

The entire N3NaclModule class looks like this:

class N3NaclModule : public pp::Module
{
public:
    virtual ~N3NaclModule()
    {
        glTerminatePPAPI();
    }
    virtual bool Init()
    {
        return glInitializePPAPI(get_browser_interface()) == 1;
    }
    virtual pp::Instance* CreateInstance(PP_Instance instance)
    {
        return new NACL::NACLBridge(instance);
    };
};

All the really interesting stuff from here on happens in the NACLBrigde object.

These two source snippets live inside the ImplementNebulaApplication() macro which all in all looks like this:

...
#elif __NACL__
#define ImplementNebulaApplication() \
class N3NaclModule : public pp::Module \
{ \
public: \
    virtual ~N3NaclModule() \
    { \
        glTerminatePPAPI(); \
    } \
    virtual bool Init() \
    { \
        return glInitializePPAPI(get_browser_interface()) == 1; \
    } \
    virtual pp::Instance* CreateInstance(PP_Instance instance) \
    { \
        return new NACL::NACLBridge(instance); \
    }; \
}; \
namespace pp \
{ \
    Module* CreateModule() \
    { \
        return new N3NaclModule(); \
    }; \
}
#elif __MACOS__
...

Now on to the NACLBridge class, this is (I know I'm repeating myself) derived from the pp::Instance class, but is called "Bridge" for a reason: in the PNaCl we're spawning a dedicated thread for the game loop, and leave the main thread (aka the Pepper thread) for event handling and rendering. Our derived pp::Instance subclass serves as a "bridge" between these 2 threads, that's why it's called NACLBridge.

The NaCl runtime will call into virtual methods of an pp::Instance object for handling events, the most important of these are Init(), DidChangeView(), HandleInputEvent(). For a complete overview and exhaustive documentation of those callback methods I recommend sifting directly through the SDK header: include/ppapi/cpp/instance.h

In the Init() method I'm only building a CommandLineArgs object from the provided raw arguments (these have been extracted from our <embed> element in the HTML page).

The actual initialisation work happens (in my case) in the first call to DidChangeView() by calling a Setup() method in the NACLBridge object. I choose this place because this is where I'm getting the current display dimensions of the <embed> element, which is required for the renderer initialisation (although now thinking about it, I might also be able to extract these from the arguments provided in the Init() method, need to try this out some time).

The NACLBridge::Setup() method only does one thing: create a thread with the NebulaMain() function as entry point, and then return to the NaCl runtime. The code inside NebulaMain() works just as on any other platform, with the only difference that it is not running on the main thread, but in its own dedicated game thread.

The big advantage to run the game loop in its own thread is that you "own the game loop", and you can perform blocking, for instance to wait for IO. The disadvantage is that you can't call any PPAPI (NaCl system functions) from the game thread, which is a blog-post-topic on its own.

So to recap: The ImplementNebulaApplication macro runs on the main thread, and creates one pp::Module and one pp::Instance object. The pp::Instance object creates the dedicated game thread, which calls into the NebulaMain() function, which from that moment on runs the game loop like on any other platform. With this approach we don't need to slice the game loop into frames like on the emscripten platform.

Now that you heroically worked your way through through all of this I'll tell you a secret: NaCl also provides a simple alternative to this complicated mess called the ppapi_simple library, which essentially provides a classic main() function running in its own thread, and because blocking is allowed on this thread, also provides normal POSIX fopen()/fclose() style blocking IO functions (sound familiar?).

Check out the header file include/ppapi_simple/ps.h as starting point.

Unfortunately this ppapi_simple library didn't exist when I started dabbling with NaCl about 2 years ago, certainly would have made life a lot easier. On the other hand, the work that had already gone into the NaCl port made the emscripten port easier, which wouldn't be the case had I used the ppapi_simple wrapper code.

Written with StackEdit.

1 Sep 2013

emscripten and PNaCl: App entry in emscripten

When quickly hacking a graphics demo on the PC or consoles, the main function usually looks like this:

int main() 
{
    if (Initialize()) 
    {
        while (!Finished()) 
        {
            Update();
            Render();
        }
        Cleanup();
    }
    return 0;
}

Trying this in on one of the browser platforms like emscripten or PNaCl results in a freeze and after a little while the browser will kill your tab :(

The problem is that the browser won't "let you own the game loop", and this is a general problem of event- or callback-driven platforms (iOS and Android have the same problem for instance). On such platforms the execution flow of the main thread is not controlled by your game code, instead there's some outer event loop which will call into your code from time to time. If you spent too much time in your allotted slice of the pie you will drag the entire system event loop down and other important events (such as input events) can't be handled fast enough. Result is that the entire user interface will feel sluggish and unresponsive to the user (for instance, scrolling in your browser tab will stutter or even freeze for multiple seconds). And if you don't return for about 30 seconds, then the browser will kill your app (Aw Snap!).

This is all bad user experience of course, we want the browser to remain responsive, and scrolling smooth all the time, also during initialisation and load time.

The core problem is that your code must always return within a few milliseconds back to the browser (e.g. 16 or 33, depending on whether you're aiming for 60 or 30fps), and this is the big riddle we need to solve for a game application running in a browser.

For a Flash or Javascript coder, or someone who's mainly writing event-driven UI applications this will all be familiar, they are used to have all their code run inside event handlers and callbacks, but typical UI apps usually don't need to do anything continuous. Event-driven applications sleep most of the time, react to (mostly input-) events from the outside, and go to sleep again. But games need to do continuous rendering, and thus are frame-driven, not event-driven, and mixing these two programming models isn't a very good idea because its hard to follow the code-flow. The usual way to implement games on event-driven platforms is to setup a timer which calls a per-frame callback function many times per second. I think hacks like this is why game programmers have a deep hatred for UI-centric platforms (and why I still like Windows despite its other shortcomings, because the recommended event handling model in Windows for games (PeekMessage -> TranslateMessage -> DispatchMessage) actually lets you "own the game loop" in a very simple and elegant way through message polling).

There are a few different approaches to either get a true continuous game loop, or at least to create the illusion of a continuous game loop on platforms where polling isn't possible, mainly depending on whether "true" pthreads-style multi-threading is supported or not.

In a Nebula3/emscripten application this isn't the case, the actual game loop and the rendering code runs on the main thread. Reason for this is that emscripten's multithreading support is built on WebWorkers. pthreads emulation isn't possible in emscripten since WebWorkers can't share memory with the main thread, furthermore, WebWorkers can't call into WebGL. This puts a lot of restrictions on our "game loop problem", and it required to refactor Nebula3's application model: in all previous ports there was always a way to somehow run a continuous game loop, mostly by moving the game loop into its own thread, but we don't have this option in emscripten (yet ... but hopefully one day, with more flexible WebWorkers).

Traditionally, a Nebula3 application used to go through a simple "Open -> Run -> Close -> Exit" sequence. An N3 main file looked like this for instance:

#include "myapplication.h"

ImplementNebulaApplication();

void
NebulaMain(const Util::CommandLineArgs& args)
{
    MyApplication app;
    app.SetCommandLineArgs(args);
    if (app.Open())
    {
        app.Run();
        app.Close();
    }
    app.Exit();
}

Instead of a main() function, there's a NebulaMain() wrapper function and a macro called ImplementNebulaApplication(). These hide the fact that not all platforms have a standard main() (for a Windows application, one would typically use WinMain() for instance).

The actual system main function is hidden inside the ImplementNebulaApplication() macro, for a PC-like platform the macro code looks like this:

int __cdecl main(int argc, const char** argv)
{
    Util::CommandLineArgs args(argc, argv);
    return NebulaMain(args);
}

Now back up to the NebulaMain() function's content: the Application::Open() method could take a while to execute (couple of seconds, worst case), and the Application::Run() will contain the "infinite" game loop, which only returns when the application should quit.

Since this wasn't a very good fit for the emscripten platform (because of this "infinite" loop inside the Run() method), first step was to make the app entry even more abstract to give the platform-specific code more wiggle room:

#include "myapplication.h"

ImplementNebulaApplication();

void
NebulaMain(const Util::CommandLineArgs& args)
{
    static MyApplication* app = new MyApplication();
    app->SetCommandLineArgs(args);
    app->StartMainLoop();
}

The most obvious change is that there's only a single StartMainLoop() method instead of the Open->Run->Close->Exit sequence. And at closer inspection some strange stuff is going on here: The application object is now created on the heap, the pointer to the object lives in the global scope, and the app object is never deleted. WTF?!?

To understand what's going on we need to dive a bit deeper into the emscripten system API.

The StartMainLoop function is actually only a one-liner on the emscripten platform:

emscripten_set_main_loop(OnPhasedFrame, 0, 0);

This sets the per-frame callback (called OnPhasedFrame) which the browser runtime will call regularly, and we'll have to do everything inside this callback function. The first 0-arg is the intended callback frequency per second (e.g. 60). 0 has a special meaning: in this case emscripten is using the modern requestAnimationFrame mechanism to call our per-frame function (instead of of the old-school setInterval or setTimeout way). The second argument is called simulateInfiniteLoop, and to understand what this does it is first necessary to understand what happens when it is not used:

The emscripten_set_main_loop() function will simply return, all the way up to main(), which will also return right after it has started! WTF indeed...

In a normal C program, returning from the main() function means that the program is shutting down of course. Local-scope objects will be destroyed before leaving main(), then global-scope objects (static initialisers).

In emscripten's case, a program which has called emscripten_set_main_loop() continues to run after main() has returned. This is a bit of a strange design decision, but makes for familiar looking code (e.g. hello_world.cpp is the same as on any other platform). Objects in the global scope will continue to exist in emscripten after main() returns, but objects in the local scope of main() will be destroyed, thus this strange way to create our application object, to prevent the app object from being destroyed after main() is left:

    static MyApplication* app = new MyApplication();

And now back to that simulate_infinite_loop argument: This is a new argument which was introduced after I started the Nebula3 emscripten port. Setting this argument to 1 will cause the emscripten_set_main_loop() function to not return to the caller, instead a Javascript exception will be thrown which essentially means that execution bails out of the C/C++ code without unwinding the (C/C++) stack, thus leaving local-scope objects of the main() function alive, everything after emscripten_set_main_loop() will never be called. So with this fix we could just as well write:

void
NebulaMain(const Util::CommandLineArgs& args)
{
    MyApplication app;
    app.SetCommandLineArgs(args);
    app.StartMainLoop();
}

Which looks a lot more friendly indeed.

So this basically covered emscripten's application startup process, we now have a per-frame function (called OnPhasedFrame) which will be called back at 60 fps. We just need to cram everything the application has to do into these 1/60sec time slices. This is fine for the actual game loop after everything has been loaded and initialised, but can be a problem for stuff like loading a new level, which could take a couple of seconds. In a traditional game, worst thing that could happen in this case is that the loading screen animation (if there is any) may stutter, but in a browser environment, such pauses will affect the entire browser tab (freezing, no scrolling, etc...), which makes a very bad first impression to the user.

So what to do? For Nebula3 I created a new Application base class called "PhasedApplication". Such a phased application goes through different life time phases (== states), such as:

Initial     -> app has just become alive
Preloading  -> currently preloading data
Opening     -> currently initializing
Running     -> currently running the game loop
Closing     -> currently shutting down
Quit        -> shutting down has finished

Each of these phases (or states) has an associated per-frame callback method (OnInitial, OnPreloading, OnOpening, etc...). The central per-frame callback will simply call into one of those methods based on the current phase/state. Each phase method invocation must return quickly (the browser's responsiveness depends on this), and may be called many times until the next phase is activated. So instead of doing a lot of stuff in a single frame, we do many small things across many frames.

Best example to illustrate this is the OnOpening() method. Suppose we need to do a lot of initialisation work during the apps Opening phase. Files need to be loaded, subsystems must be initialised and so on. This may take a couple of seconds. But the rule is that we must ideally return within 1/60sec, and we also don't have an independent render thread which could hide the main-thread freeze behind a smooth loading animation. So we need to do just a little bit of initialisation work, possibly update rendering of the loading screen, and return to the browser runtime. But since we haven't switched to the next state yet, OnOpening() will be called back again, and we can do the next piece of initialisation work. Sounds awkward of course, and it is, but there's not a lot we can do about it.

A new Javascript concept called generators could help to clean up this mess, with these it should be possible to chop a long sequence of actions into small slices while leaving the function context intact (essentially like a yield() function in a cooperative multithreading system) - catapulting Javascript into the illustrious company of Windows1.x and Classic MacOS. But enough with the ranting ;)

A somewhat cleaner method for long initialisation work is starting asynchronous actions through a WebWorker job in the first call to OnOpening() and during the next OnOpening calls check for all of those actions to have finished, gather the results, and finally switch to the next state, which would be Running. In the worst case, initialisation code must literally be chopped into little slices running on the main thread.

So that's it for this blog post. Originally I wanted to compare emscripten's and PNaCl startup process, but this would be way too much text for a single posts, so next will very likely be a similar walk through of the PNaCl application start, and after that the next big topic: how to handle asset loading.

Written with StackEdit.