24 May 2014

Shader Compilation and IDEs

I recently played around with shader code generation and the GLSL reference compiler in Oryol.

The result is IMHO pretty neat:

Shader source files (*.shd) live in the IDE next to C++ files:
enter image description here

Shader files are written in normal GLSL syntax with custom annotations (those @ and $ tags):
enter image description here

When compiling the project, a custom build step will generate vertex- and fragment shaders for different GLSL versions and run them through the GLSL reference compiler. Any errors from the reference compiler are converted to a format which can be parsed by the IDE:

enter image description here

Error parsing also works in Visual Studio:
enter image description here

Unfortunately I couldn’t get error parsing to work in QtCreator on Linux. The error messages are recognised, but double-clicking them doesn’t work.

After the GLSL compiler pass, a C++ header/source file pair will be created which contains the GLSL shader code and some C++ glue to make the shader accessible from the engine side.

The edit-compile-test cycle is only one or two seconds, depending on the link time of the demo code. Also, since the shader generation runs as a normal build step, shader code will also be generated and validated in command line builds.

Here’s how it works:

When cmake runs to create the build files it will look for XML files in the source code directories. For each XML file, a custom build target will be created which invokes a python script. This ‘generator script’ will generate a C++ header/source pair during compilation.

This generic code generation has only been used so far for the Oryol Messaging system, but it is flexible enough to cover other code generation scenarios (like generating shader code).

Setting up the custom build target involves 3 steps:

The actual build target must be created, cmake has the add_custom_target macro for this:

  COMMENT "Generating sources...")

This statement takes a variable target with the name of the build target which will compile the generated C++ sources plus a xmlFiles list variable and it will generate a new build target called [target]_gen The variables PYTHON and ORYOL_ROOT_DIR are config variables pointing to the python executable and the Oryol root directory.

To get the right build order, a target dependency must be defined so that the generated target is always run before the build target which needs the generated C++ source code:

add_dependencies(${target} ${target}_gen)

Finally we need to resolve a chicken-egg situation. All C++ files must exist when cmake assembles the build files, but the generated C++ files will only be created during the first build. To fix this situation, empty placeholder files are created if the generated sources don’t exist yet:

foreach(xmlFile ${xmlFiles})
    string(REPLACE .xml .cc src ${xmlFile})
    string(REPLACE .xml .h hdr ${xmlFile})
    if (NOT EXISTS ${src})
        file(WRITE ${src} " ")
    if (NOT EXISTS ${hdr})
        file(WRITE ${hdr} " ")

These 3 steps take care of the build configuration via cmake.

On to the python generator script:

First, the generator script parses the XML ‘source file’ which caused its invokation. For the shader generator, the XML file is very simple:

<Generator type="ShaderLibrary" name="Shaders" >
    <AddDir path="shd"/>

The most important piece is the AddDir tag which tells the generator script where it finds the actual shader source files. More then one AddDir can be added if the shader sources are spread over different directories.

Generator scripts must also include a dirty-check and only actually overwrite the target C++ files when the source files (in this case: the XML file and all shader sources) are newer then the target sources to prevent unneeded compilation of dependent files.

Shader File Parsing

Shader files will be processed by a simple line-parser:

  1. comments and white-space will be removed
  2. find and process ‘@’ and ‘$’ keywords
  3. gather GLSL code lines and keep track of their source file and line numbers (this is important for mapping error messages back later)

A very minimal shader file looks like this:

@vs MyVertexShader
@uniform mat4 mvp ModelViewProj
@in vec4 position
@in vec2 texcoord0
@out vec2 uv
void main() {
  $position = mvp * position;
  uv = texcoord0;

@fs MyFragmentShader
@uniform sampler2D tex Texture
@in vec2 uv
void main() {
  $color = $texture2D(tex, uv);

@bundle Main
@program MyVertexShader MyFragmentShader

This defines one vertex shader (between the @vs and @end tags) and a matching fragment shader (between @fs and @end). The vertex shader defines a 4x4 matrix uniform with the GLSL variable name mvp and the ‘bind name’ ModelViewProj, and it expects position and texture coordinates from the vertex. The vertex shader transforms the vertex-position into the special variable $position and forwards the texture coordinate to the fragment shader.

The fragment shader defines a texture sampler uniform with the GLSL variable name tex and the bind name Texture. It takes the texture coordinates emitted by the vertex shader, samples the texture and writes the color into the special variable $color.

Finally a shader @bundle with the name ‘Main’ is defined, and one shader program created from the previously defined vertex- and fragment-shader is attached to the bundle. A shader bundle is an Oryol-specific concept and is simply a collection of one or more shader programs that are related to each other.

Another useful tag which isn’t used in this simple example are the @block and @use tag. A @block encapsulates a piece of code which can then later be included with a @use tag in other blocks or vertex-/fragment-shaders. This is basically the missing #include mechanism for GLSL files.

Here’s some @block sample code, first a Util block is defined with general utility functions, then a block VSLighting which would contain lighting functions for vertex shaders, and FSLighting with lighting functions for fragment shaders. Both VSLighting and FSLighting want to use functions from the Util block (via @use Util). Finally the vertex- and fragment-shaders would contain a @use VSLighting and @use FSLighting (not shown). The shader code generator would then resolve all block dependencies and include the required the code blocks in the generated shader source in the right order:

@block Util
// general utility functions
vec4 bla() {
  vec4 result;
  return result;

@block VSLighting
// lighting functions for the vertex shader
@use Util
vec4 vsBlub() {
  return bla();

@block FSLighting
// lighting functions for the fragment shader
@use Util
vec4 fsBlub() {
  return bla();

GLSL Code Generation and Validation

From the ‘tagged shader source’, the shader generator script will create actual vertex- and fragment-shader code for different GLSL versions and feed it to the reference compiler for validation.

For instance, the above simple vertex/fragment-shader source would produce the following GLSL 1.00 source code (for OpenGLES2 and WebGL):

uniform mat4 mvp;
attribute vec4 position;
attribute vec2 texcoord0;
varying vec2 uv;
void main() {
  gl_Position = mvp * position;
  uv = texcoord0;

The output for a more modern GLSL version would look slightly different:

#version 150
uniform mat4 mvp;
in vec4 position;
in vec2 texcoord0;
out vec2 uv;
void main() {
  gl_Position = mvp * position;
  uv = texcoord0;

The GLSL reference compiler is called once per GLSL version and vertex-/fragment-shader and the resulting output is captured into a string variable. The python code to start an exe and capture its output looks like this:

child = subprocess.Popen([exePath, glslPath], stdout=subprocess.PIPE)
out = ''
while True :
    out += child.stdout.read()
    if child.poll() != None :
return out

The output will then be parsed for error messages and error line numbers. Since these line-numbers are pointing into the generated source code they are not useful themselves but must be mapped back to the original source-file-path and line-numbers. This is why the line-parser had to store this information with each extracted source code line.

The mapped source-file-path, line-number and error message must then be formatted into the gcc/clang- or VStudio-error-message format, and if an error occurs, the python script will terminate with an error code so that the build is stopped:

if platform.system() == 'Windows' :
    print '{}({}): error: {}'.format(FilePath, LineNumber + 1, msg)
else :
    print '{}:{}: error: {}\n'.format(FilePath, LineNumber + 1, msg)
if terminate:

This formatting works for Xcode and VisualStudio. The error is displayed by the IDE and can be double-clicked to position the text cursor over the right source code location. It doesn’t work in Qt Creator yet unfortunately, and I haven’t tested Eclipse yet.

Another thing to keep in mind is that build jobs can run in parallel. At first I was writing the intermediate GLSL files for the reference compiler into files with simple filenames (like ‘vs.vert’ and ‘fs.frag’). This didn’t cause any problems when doing trivial tests, but once I had converted all Oryol samples to use the shader generator I was sometimes getting weird errors from the reference compiler which didn’t make any sense at first.

The problem was that build jobs were running at the same time and overwrote each others intermediate files. The solution was to use randomized filenames which cannot collide. As always, python has a module just for this case called ‘tempfiles’:

# this writes to a new temp vertex shader file 
f = tempfile.NamedTemporaryFile(suffix='.vert', delete=False)
writeFile(f, lines)

# call the validator

# delete the temp file when done

The C++ Side

Last but not least a quick look at the generated C++ source code. The C++ header defines a namespace with the name of the shader-library, and one class per shader-bundle. The very simple vertex/fragment-shader sample from above would generate a header like this:

#pragma once
/*  #version:1#
    machine generated, do not edit!
#include "Render/Setup/ProgramBundleSetup.h"
namespace Oryol {
namespace Shaders {
class Main {
  static const int32 ModelViewProj = 0;
  static const int32 Texture = 1;
  static Render::ProgramBundleSetup CreateSetup();

Note the ModelViewProj and Texture constant definitions. These are used to set the uniform values in the C++ render loop.

How this code is actually used for rendering is a topic of its own. For now let me just point to the Oryol sample source code:


What’s next

The existing shader tags are already quite useful but only the beginning. The real problem I want to solve is to manage slightly differing variations of the same shader. For instance there might exist a specific high-level material, which must be applied to static and skinned geometry (2 variations), can cast shadows (4 variations: static shadow caster, skinned shadow caster), should be available in a forward-renderer and deferred-renderer (== many more slightly different shader variations). Sometimes an ueber-shader approach is better, and sometimes actually separate shaders for each variation are better.

The guts of those material shaders are always built from the same small code fragments, just arranged and combined differently.

Hopefully a couple of new ‘@’ and ‘$’ tags will be enough, but how this will look like in detail I don’t know yet. One inspiration are web-template engines which build web pages from a set of templates and rules. Another inspiration are the existing connect-the-dots shader editors (even though I want to keep the focus on ‘shaders-as-source-code’, not ‘shader-as-data’, but some limited runtime-code-generation would still make sense).

And of course the right middle-ground between ‘modern GLSL’ and ‘legacy GLSL’ must be found. Unfortunately OpenGL ES2 / WebGL1.0 will have to be the foundation for quite some time.

And that’s all for today :)

Written with StackEdit.

20 Apr 2014

cmake and the Android NDK

TL;DR: how to build Android NDK applications with cmake instead of the custom NDK build system, this is useful for projects which already use cmake to create multiplatform/cross-compiling build files.

Update: Thanks to thp for pointing out a rather serious bug: packaging the standard shared libraries into the APK should NOT be necessary since these are pre-installed on the device. I noticed that I didn’t set a library search path to the toolchain lib dir in the linker step (-L…) which might explain the crash I had earlier, but unfortunately I can’t reproduce this crash anymore with the old behaviour (no library search path and no shared system libraries in the APK). I’ll keep an eye on that and update the blog post with my findings.

I’ve spent the last 2.5 days adding Android support to Oryol’s build system. This wasn’t exactly on my to-do list until I sorta “impulse-bought” a Nexus7 tablet last Thursday. It basically went like this “hey that looks quite neat for a non-iPad tablet => wow, scrolling feels smooth, very non-Android-like => holy shit it runs my Oryol WebGL samples at 60fps => hmm 179 Euros seems quite reasonable…” - I must say I’m impressed how far the Android “user experience” has come since I last dabbled with it. The UI finally feels completely smooth, and I didn’t have any of those Windows8-Metro-style WTF-moments yet.

Ok, so the logical next step would be to add support for Android to the Oryol build system (if you don’t know what Oryol is: it’s a new experimental C++11 multi-plat engine I started a couple months ago: https://github.com/floooh/oryol).

The Oryol build system is cmake-based, with a python script on top which simplifies managing the dozens of possible build-configs. A build-config is one specific combination of target-platform (osx, ios, win32, win64, …), build-tools (make, ninja, Visual Studio, Xcode, …) and compile-mode (Release, Debug) stored under a descriptive name (e.g. osx-xcode-debug, win32-vstudio-release, emscripten-make-debug, …).

The front-end python script called ‘oryol’ is used to juggle all the build-configs, invoke cmake with the right options, and perform command line builds.

One can for instance simply call:

> ./oryol update osx-xcode-debug

…to generate an Xcode project.

Or to perform a command line build with xcodebuild instead:

> ./oryol build osx-xcode-debug

Or to build Oryol for emscripten with make in Release mode (provided the emscripten SDK has been installed):

> ./oryol build emscripten-make-release

This also works on Windows (32- or 64-bit):

> oryol build win64-vstudio-debug
> oryol build win32-vstudio-debug

…or on Linux:

> ./oryol build linux-make-debug

Now, what I want to do with my shiny new Nexus7 is of course this:

> ./oryol build android-make-debug

This turned out to be harder then usual. But lets start at the beginning:

A cross-compiling scenario is normally well defined in the GCC/cmake world:

A toolchain wraps the target-platform’s compiler tools, system headers and libs under a standardized directory structure:

The compiler tools usually reside in a bin subdirectory, and are called gcc and g++, or in the LLVM world: clang and clang++, sometimes the tools also have a prefix: pnacl-clang and pnacl-clang++), or they have completely different names (like emcc in the emscripten SDK).

Headers and libs are often located in a usr directory (usr/include and usr/lib).

The toolchain headers contain at least the the C-Runtime headers, like stdlib.h, stdio.h and usually the C++ headers (vector, iostream, …) and often also the OpenGL headers and other platform-specific header files.

Finally the lib directory contains precompiled system libraries for the target platform (for instance libc.a, libc++.a, etc…).

With such a standard gcc-style toolchain, cross-compilation is very simple. Just make sure that the toolchain-compiler tools are called instead of the host platform’s tools, and that the toolchain headers and libs are used.

cmake standardizes this process with its so-called toolchain-files. A toolchain-file defines what compilers tools, headers and libraries should be used instead of the ‘default’ ones, and usually also overrides compile and linker flags.

The typical strategy when adding a new target platform to a cmake build system looks like this:

  • setup the target platform’s SDK
  • create a new toolchain file (obviously)
  • tell cmake where to find the compiler tools, header and libs
  • add the right compile and linker flags

Once the toolchain file has been created, call cmake with the toolchain file:

> cmake -G"Unix Makefiles" -DCMAKE_TOOLCHAIN_FILE=[path-to-toolchain-file] [path-to-project]

Then run make in verbose mode to check whether the right compiler is called, and with the right options:

> make VERBOSE=1

This approach works well for platforms like emscripten or Google Native Client. Some platforms require a bit of additional cmake-magic, a Portable Native Client executable for instance must be “finalized” after it has been linked. Additional build steps like these can be added easily in cmake with the add_custom_command macro.

Integrating Android as a new target platform isn’t so easy though:

  • the Android SDK itself only allows to create pure Java applications, for C/C++ apps, the separate Android NDK (Native Development Kit) is required
  • the NDK doesn’t produce complete Android applications, it needs the Android Java SDK for this
  • native Android code isn’t a typical executable, but lives in a shared library which is called from Java through JNI
  • the Android SDK and NDK both have their own build systems which hide a lot of complexity
  • …this complexity comes from the combination of different host platforms (OSX, Linux, Windows), target API levels (android-3 to android-19, roughly corresponding to Android versions), compiler versions (gcc4.6, gcc4.9, clang3.3, clang3.4), and finally CPU architectures and instruction sets (ARM, MIPS, X86, with several variations for ARM (armv5, armv7, with or without NEON, etc…)
  • C++ support is still bolted on, the C++ headers and libs are not in their standard locations
  • the NDK doesn’t follow the standard GCC toolchain directory structure at all

The custom build system coming with the NDK does a good job to hide all this complexity, for instance it can automatically build for all CPU architectures, but it stops after the native shared library has been compiled: it cannot create a complete Android APK. For this, the Android Java SDK tools must be called from the command line.

So back to how to make this work in cmake:

The plan looks simple enough:

  1. compile our C/C++ code into a shared library instead of an executable
  2. somehow get this into a Java APK package file…
  3. …deploy APK to Android device and run it

Step 1 starts rather innocent, create a toolchain file, look up the paths to the compiler tools, headers and libs in the NDK, then lookup the compiler and linker command line args by watching a verbose build. Then put all this stuff into the right cmake variables. At least this is how it usually works. Of course for Android it’s all a bit more complicated:

  • first we need to decide on a target CPU architecture and what compiler to use. I settled for ARM and gcc4.8, which leads us to […]/android-ndk-r9d/toolchains/arm-linux-androideabi-4.8/prebuilt
  • in there is a directory darwin-x86_64 so we need separate paths by host platform here
  • finally in there is a bin directory with the compiler tools, so GCC would be for instance at [..]/android-ndk-r9d/toolchains/arm-linux-androideabi-4.8/prebuilt/darwin-x86_64/bin/arm-linux-androideabi-gcc
  • there’s also an include, lib and share directory but the stuff in there definitely doesn’t look like system headers and libs… bummer.
  • the system headers and libs are under the platforms directory instead: [..]/android-ndk-r9d/platforms/android-19/arch-arm/usr/include, and [..]/android-ndk-r9d/platforms/android-19/arch-arm/usr/lib
  • so far so good… put this stuff into the toolchain file and it seems to compile fine – until the first C++ header must be included - WTF?
  • on closer inspection, the system include directory doesn’t contain any C++ headers, and there’s different C++ lib implementations to choose from under [..]/android-ndk-r9d/sources/cxx-stl

This was the point where was seriously thinking about calling it a day until I stumbled across the make-standalone-toolchain.sh in build/tools. This is a helper script which will build a standard GCC-style toolchain for one specific Android API-level and target CPU:

sh make-standalone-toolchain.sh –-platform=android-19 

This will extract the right tools, headers and libs, and also integrate C++ headers (by default gnustl, but can be selected with the –stl option). When the script is done, a new directory ‘android-toolchain’ has been created which follows the GCC toolchain standard, and is much easier to integrate with cmake:

The important directories are:
- [..]/android-toolchain/bin, this is where the compiler tools are located, these are still prefixed though (e.g. arm-linux-androideabi-gcc
- [..]/android-toolchain/sysroot/usr/include CRT headers, plus EGL, GLES2, etc…, but NOT the C++ headers
- [..]/android-toolchain/include the C++ headers are here, under ‘c++’
- [..]/android-toolchain/sysroot/usr/lib .a and .so system libs, libstc++.a/.so is also here, no idea why

After setting these paths in the toolchain file, and telling cmake to create shared-libs instead of exes when building for the Android platform I got the compiler and linker steps. Instead of a CoreHello executable, I got a libCoreHello.so. So far so good.

Next step was to figure out how to get this .so into a APK which can be uploaded to an Android device.

The NDK doesn’t help with this, so this is where we need the Java SDK tools, which uses yet another build system: ant. From looking at the SDK samples I figured out that it is usually enough to call ant debug or ant release within a sample directory to build an .apk file into a bin subdirectory. ant requires a build.xml file which defines the build tasks to perform. Furthermore, Android apps have an embedded AndroidManifest.xml file which describes how to run the application, and what privileges it requires. None of these exist in the NDK samples directories though…

After some more exploration it became clear: The SDK has a helper script called android which is used (among many other things) to setup a project directory structure with all required files for ant to create a working APK:

> android create project
    --path MyApp
    --target android-19
    --name MyApp
    --package com.oryol.MyApp
    --activity MyActivity

This will setup a directory ‘MyApp’ with a complete Android Java skeleton app. Run ‘ant debug’ in there and it will create a ‘MyApp-debug.apk’ in the ‘bin’ subdirectory which can be deployed to the Android device with ‘adb install MyApp-debug.apk’, which when executed displays a ‘Hello World, MyActivity’ string.

Easy enough, but there are 2 problems, first: how to get our native shared library packaged and called?, and second: the Java SDK project directory hierarchy doesn’t really fit well into the source tree of a C/C++ project. There should be a directory per sample app with a couple of C++ files and a CMakeLists.txt file and nothing more.

The first problem is simple to solve: the project directory hierarchy contains a libs directory, all .so files in there will be copied into the APK by ant (to verify this: a .apk is actually a zip file, simply changed the file extension to zip and peek into the file). One important point: the lib directory contains one sub-directory-level for the CPU architecture, so once we start to support multiple CPU instruction sets we need to put them into subdirectories like this:

FlohOfWoe:libs floh$ ls
armeabi     armeabi-v7a mips        x86

Since my cmake build-system currently only supports building for armeabi-v7a I’ve put my .so file in the armeabi-v7a subdirectory.

Now I thought that I had everything in place, I got an APK file with my native code .so lib in it, I used the NativeActivity and the android_native_app_glue.h approach, and logged out a “Hello World” to the system log (which can be inspected with adb logcat from the host system).

And still the App didn’t start, instead this showed up in the log:

D/AndroidRuntime(  482): Shutting down VM
W/dalvikvm(  482): threadid=1: thread exiting with uncaught exception (group=0x41597ba8)
E/AndroidRuntime(  482): FATAL EXCEPTION: main
E/AndroidRuntime(  482): Process: com.oryol.CoreHello, PID: 482
E/AndroidRuntime(  482): java.lang.RuntimeException: Unable to start activity ComponentInfo{com.oryol.CoreHello/android.app.NativeActivity}: java.lang.IllegalArgumentException: Unable to load native library: /data/app-lib/com.oryol.CoreHello-1/libCoreHello.so
E/AndroidRuntime(  482):    at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2195)

This was the second time where I banged my head against the wall for a while until I started to look into how linker dependencies are resolved for the shared library. I was pretty sure that I gave all the required libs on the linker command line (-lc -llog -landroid, etc), the error was that I assumed that these are linked statically. Instead default linking against system libraries is dynamic. The ndk-depends helps in finding the dependencies:

localhost:armeabi-v7a floh$ ~/android-ndk-r9d/ndk-depends libCoreHello.so 

This is basically the list of .so files which must be contained in the APK. After I copied these to the SDK project's lib directory, together with my libCoreHello.so. Update: These shared libs are not supposed to be packaged into the APK! Instead the standard system shared libraries which already exist on the device should be linked at startup.

I finally saw the sweet, sweet ‘Hello World!’ showing up in the adb log!

But I skipped one important part: so far I fixed everything manually, but of course I want automated Android batch builds, and without having those ugly Android skeleton project files in the git repository.

To solve this I did a bit of cmake-fu:

Instead of having the Android SDK project files committed into version control, I’m treating these as temporary build files.

When cmake runs for an Android build target, it does the following additional steps:

For each application target, a temporary Android SDK project is created in the build directory (basically the ‘android create project’ call described above):

# call the android SDK tool to create a new project
execute_process(COMMAND ${ANDROID_SDK_TOOL} create project
                --path ${CMAKE_CURRENT_BINARY_DIR}/android
                --target ${ANDROID_PLATFORM}
                --name ${target}
                --package com.oryol.${target}
                --activity DummyActivity

The output directory for the shared library linker step is redirected to the ‘libs’ subdirectory of this skeleton project:

# set the output directory for the .so files to point to the android project's 'lib/[cpuarch] directory

The required system shared libraries are also copied there: (DON’T DO THIS, normally the system’s standard shared libraries should be used)

# copy shared libraries over from the Android toolchain directory
# FIXME: this should be automated as post-build-step by invoking the ndk-depends command
# to find out the .so's, and copy them over

The default AndroidManifest.xml file is overwritten with a customized one:

# override AndroidManifest.xml 
file(WRITE ${CMAKE_CURRENT_BINARY_DIR}/android/AndroidManifest.xml
    "<manifest xmlns:android=\"http://schemas.android.com/apk/res/android\"\n"
    "  package=\"com.oryol.${target}\"\n"
    "  android:versionCode=\"1\"\n"
    "  android:versionName=\"1.0\">\n"
    "  <uses-sdk android:minSdkVersion=\"11\" android:targetSdkVersion=\"19\"/>\n"
    "  <uses-feature android:glEsVersion=\"0x00020000\"></uses-feature>"
    "  <application android:label=\"${target}\" android:hasCode=\"false\">\n"
    "    <activity android:name=\"android.app.NativeActivity\"\n"
    "      android:label=\"${target}\"\n"
    "      android:configChanges=\"orientation|keyboardHidden\">\n"
    "      <meta-data android:name=\"android.app.lib_name\" android:value=\"${target}\"/>\n"
    "      <intent-filter>\n"
    "        <action android:name=\"android.intent.action.MAIN\"/>\n"
    "        <category android:name=\"android.intent.category.LAUNCHER\"/>\n"
    "      </intent-filter>\n"
    "    </activity>\n"
    "  </application>\n"

And finally, a custom build-step to invoke the ant-build tool on the temporary skeleton project to create the final APK:

    set(ANT_BUILD_TYPE "debug")
    set(ANT_BUILD_TYPE "release")

With all this in place, I can now do a:

> ./oryol make CoreHello android-make-debug

To compile and package a simple Hello World Android app!

What’s currently missing is a simple wrapper to deploy and run an app on the device:

> ./oryol deploy CoreHello
> ./oryol run CoreHello

These would be simple wrappers around the adb tool, later this should of course also work for iOS apps.

Right now the Android build system only works on OSX and only for the ARM V7A instruction set, and there’s no proper Android port of the actual code yet, just a single log message in the CoreHello sample.

Phew, that’s it! All this stuff is also available on github (https://github.com/floooh/oryol/tree/master/cmake).

Written with StackEdit.

2 Feb 2014

It's so quiet here...

…because I’m doing a lot of weekend coding at the moment. I basically caught the github bug over the holidays:


I’ve been playing around with C++11, python, Vagrant, puppet and chef recently:


  • I like: move semantics, for (:), variadic template arguments, std::atomic, std::thread, std::chrono, possibly std::function and std::bind (haven’t played around with these yet)
  • (still) not a big fan of: auto, std containers, exceptions, rtti, shared_ptr, make_shared
  • thread_local vs __thread vs __declspec(thread) is still a mess across Clang/OSX, GCC and VisualStudio
  • the recent crazy-talk about integrating a 2D drawing API into the C++ standard gives me the shivers, what a terrible, terrible idea!


  • best choice/replacement for command-line scripts and asset tools (all major 3D modelling/animation tools are python-scriptable)
  • performance of the standard python interpreter is disappointing, and making something complex like FBX SDK work in alternative Python compilers is difficult or impossible

Vagrant plus Puppet or Chef

  • Vagrant is extremely cool for having an isolated cross-compilation Linux VM for emscripten and PNaCl, instead of writing a readme with all the steps required to get a working build machine, you can simply check-in a Vagrantfile into the versioning system repository, and other programmers simply do a ‘vagrant up’ and have a VM which ‘just works’
  • the slow performance of shared directories on VirtualBox requires some silly workarounds, supposedly this is better with VMWare Fusion, but haven’t tried yet
  • Puppet vs Chef are like Coke vs Pepsi for such simple “stand-alone” use-cases. Chef seems to be more difficult to get into, but I think in the end it is more rewarding when trying to “scale up”

Written with StackEdit.

20 Dec 2013

Asset loading in emscripten and PNaCl

Loading data from a file on disk doesn’t look like a big deal in a normal C application:

int main() {
    // open file for reading
    FILE* fh = fopen("filename", "rb");
    if (fh) {

        // read some bytes
        char buffer[128];
        fread(buffer, sizeof(buffer), 1, fh);

        // close the file
        fh = 0;
    return 0;   

When doing a real-world game this simple approach has a couple of problems:

  • blocking: The above code is blocking, when reading from a fast hard disk this is probably not even noticeable, but try loading from a DVD or Bluray disk or some sort of network drive over a slow connection and the game loop will stutter
  • hard-coded paths: The concept of a current directory is often not portable, you can’t depend on the current directory being set to where your executable is. It is better to establish an absolute root location and have all filename paths in the game relative to that (of course how to establish this root location is platform-dependent again, for instance get the absolute path to the executable, and go on from there)
  • can’t use different transfer protocols: the above code works fine for local filesystems, but not loading data from a web- or ftp-server, and operations like creating a new file, or randomly seeking in a file may not be available with other protocols.

It is a good idea to restrict the type of file operations that a game can use, e.g.:

  • do we really need write and create access? An offline game may need to write save-game files and options, while an online game probably doesn’t need access to the local file system at all.
  • do we really need random seek? Randomly seeking in a file can be either impossible (HTTP) or slow because some mechanical device must be moved around, it’s often better to read a file straight into memory and seek there or to avoid such operations at all.
  • do we really need to iterate directory content? again, this can be either expensive (mechanical storage device) or impossible (in plain HTTP for instance)
  • do we really need free-form file paths? Games usually need to access very few places in the file system (the asset directory which is usually read-only, and maybe some sort of per-user writable location for settings and save-games)
  • do we really need access to file attributes? Stuff like last modification time, ownership, readable/writable. Usually this is not needed.
  • do we really need the concept of a “current directory”? This can be tricky for portability, and some platforms don’t have the concept of a current working directory at all

That’s a lot of features we don’t need in a game and which are also often not provided by web-based runtime platforms like PNaCl and JS. It helps to look at the HTTP protocol for inspiration, since that is where we need to load our data from anyway in the web scenario:

  • file system paths become URLs
  • only one read operation GET, which usually provides an entire file (but can also load a part of a file)
  • no directory iteration
  • no “write access” unless specifically allowed by the server
  • state-less, no current directory or current read position
  • operations can take very long (seconds or even minutes)

For a game which wants to load its asset from the web the IO system should be designed around those restrictions.

As an example, here’s an overview of the Nebula3 IO system:

  • all paths are URLs: Not much to say about this :)
  • a single root location: At application start, a root location is established, this is usually a file:// URL pointing to the app’s installation directory, but can be overriden to point (for instance) to an http:// URL. Loading all data from a web server instead of the local hard disk is done with a single line of code which sets a different root location.
  • Amiga assigns as path aliases: A filesystem path to a texture looks like this in N3: tex:walls/brickwall.dds, where the tex: is an “AmigaOS assign” which is replaced with an absolute path, incorporating the root directory.
  • all paths are absolute: there is no concept of a “current directory” in Nebula3, instead all paths resolve to an absolute location at runtime by replacing assigns in the path.
  • pluggable “virtual filesystem” modules associated with the URL scheme: URLs starting with file:// are handled by a different file system module than http://, plus Nebula3 apps can plug in their own filesystem modules if they want
  • stream objects, stream readers and stream writers: this is interesting in the web context only because there’s a MemoryStream object which is used to store and transfer downloaded data in RAM
  • asynchronous IO is really simple: more on that later in this post :)

Since Nebula3 is also used as a command-line-tools framework, the IO subsystem is a bit of a hybrid, which in hindsight was a design fault. There are still all these writing and file creation operations, blocking IO, directory walking etc… which makes the API quite bloated. In a new engine I would probably strictly separate the two scenarios, use the engine as a game framework only, which only supports very simple asynchronous read operations, and write the tools with another framework (or even other language, like python).

Asynchronous IO in Nebula3

Let’s look at async IO in Nebula3 a bit closer since this is the most interesting feature for web-based platforms. This is based on the “non-blocking future” pattern (or whatever you wanna call it) and depends on a frame-driven instead of event- or callback-driven application architecture.

Here’s some pseudo code:

void StartLoading() {
    // To start loading data we need to create an 
    // IO request object and "send it off" to the
    // IoInterface singleton for asynchronous processing
    Ptr<IO::ReadStream> req = IO::ReadStream::Create();

    // The IoRequest is now "in flight" and will contain
    // a result at some point in the future. Because we need
    // to check for completion in some later frame we need to
    // store the smart pointer somewhere
    this->pendingRequest = req;

    // ok, we're done for this frame...

void HandlePendingRequest() {
    // this function must be called regularly (e.g. per
    // frame) to check whether the async loading operation
    // has finished
    if (this->pendingRequest.isvalid() &&
        this->pendingRequest->Handled()) {

        // ok, the request has been completed, if 
        // the file was loaded successfully we get
        // a MemoryStream object with its content
        if (this->pendingRequest->GetSuccess()) {

            // actually load the data from the memory
            // stream and throw the request object away,
            // since all file data is in memory, we can
            // actually use the normal open/seek/read/close
            // pattern on the stream object

            // delete the request object, 
            // remember, this is a smart pointer :)
            this->pendingRequest = 0;

There may be less verbose or more elegant versions of this code of course, but the basic idea is that you start loading a file in one frame, and then need to check in the following frames if loading has finished (or failed), and get the completely loaded data in a memory buffer which can be parsed with “traditional” read and seek functions (and which is very fast since everything happens in memory).

This implies that the engine needs to know what to do while some required data has not been loaded yet. For a graphics pipeline this is quite simple by either rendering nothing or some placeholder while the data is still loading.

But there are cases where the code cannot progress without important data being loaded, or where it would be very tricky or impossible to implement asynchronous IO (for instance when integrating complex 3rd party libraries like sqlite).

If we could simply block this wouldn’t be a problem: the worst thing that would happen is that our game loop would stutter, but on web platforms we cannot simply block the main thread (it is easier on PNaCl where it is recommended to move the game loop into a separate thread, which then can block waiting for the main thread to process asynchronous IO requests).

For Nebula3 I fixed this with an additional application object state called the “Preloading Phase”. The idea is that the app enters this state outside of the normal game loop (for instance while displaying a loading screen), and during this state, populates a simple in-memory filesystem (basically just a lookup-table with URLs as keys and MemoryStream objects as values) with the asynchronously loaded data. When all data has been loaded (or failed to load), the app will leave the preloading phase (and hide the loading screen) and synchronous loader code will transparently get the data from the in-memory file system instead of starting an actual asynchronous IO request. Since all this preloaded data resides in memory this means of course that only small data and few files should be preloaded, and the majority of data should be asynchronously streamed on demand during the game loop. It’s really only a workaround for the few cases where synchronous access is absolutely necessary.

More details about here in one of my presentations: http://www.slideshare.net/andreweissflog3/gdce2013-cpp-ontheweb

emscripten and PNaCl details

Ok, almost done!

For the emscripten and PNaCl platforms I basically wrote a simple Nebula3 filesystem module which fires HTTP GET requests through he respective emscripten and PNaCl API calls, and copies the received data into MemoryStream objects, it’s only a few hundred lines of code each.

The main difference between the two platforms lies in the use of threading:

  • PNaCl works like “traditional” platforms, there are a number of IO threads (about 10, but that’s tweakable) each of them processes one IO request at a time, so that as many IO requests can be in flight as there are IO threads. Those threads also directly handle processing of the received data like decompression.
  • In emscripten, the IO calls (sending a HTTP request, and the callback when the response has been received) is handled on the main thread, but the expensive processing (e.g. decompression) of the received data is handed over to a WebWorker pool (usually 4 WebWorker threads). There can still be multiple IO requests in flight because the IO system doesn’t “wait” for an IO request to finish before firing a new one (but it is still throttled to restrict the number of requests in flight in case a lot of requests arrive in a very short time period).

The actual code implementation is straightforward so I’ll spare you the source code samples. The respective class in PNaCl is called pp::URLLoader, and emscripten offers a whole set of rather specialized C functions which all start with emscripten_async_wget. Both fire an HTTP request (emscripten does an XmlHttpRequest, and PNaCl presumably under the hood as well - this has some unfortunate cross-domain implications), and invoke callbacks on failure or when data has arrived. PNaCl needs a bit more coding work since data is received in chunks (and the receive callback can be called multiple times), while emscripten waits until all data is received before calling the received-callback once.

emscripten has more options to integrate the data with the web page DOM (for instance it can automatically create DOM image objects from downloaded image files), and it also has a very advanced CRT IO emulation layer (so you actually can directly use fopen/fclose after the data has been downloaded or preloaded), but I haven’t looked into these advanced concepts very closely since Nebula3 already does a lot of this layering itself.

There’s a similar filesystem layer for NaCl called nacl-mounts, but similarly to emscripten I didn’t look into this very closely since the low-level URL loading functions were a better fit for N3.

That’s it for today, have a nice Christmas everyone :)

Written with StackEdit.

3 Nov 2013

Messing around with MESS (and JSMESS)

And now for something completely different:

Since I’m dabbling with emscripten I’ve had this idea in my head to write or port a KC85/3 emulator, so that I could play the games I wrote as a kid directly in the browser. The existing KC85 emulators I was aware of are not trival to port, they either depend on x86 inline assembly, or are hardwired to a specific UI framework (if you read German, here’s an overview on what’s out there: http://www.kc85emu.de/Emulatoren/Emulatoren.htm )

About 2 weeks ago I started to look around more seriously for a little side project to spent my 3 weeks of vacation around Christmas (I need to burn my remaining vacation days, in Germany employees are basically required by law to take all their vacation - tough shit ;) My original plan was to cobble together a minimal emulator just enough to run my old games: Take an existing Z80 CPU emulator like the one from FUSE, hack some keyboard input and video output and go on from there.

Thankfully I then had a closer look at MESS. I always thought that MESS could only emulate the most popular Western game machines like the C64 or Atari 400, but it turns out that this beast can emulate pretty much any computer that ever existed (between 600 and 1700, depending on how you count), it even has support for the PDP-1 from the early 60’s! When searching through the list of emulated systems here (http://www.progettoemma.net/mess/sysset.php) I stumbled over the following entries:

  • HC900 / KC 85/2
  • KC 85/3
  • KC 85/4
  • KC Compact
  • Lerncomputer LC 80
  • KC 85/1
  • Z1013
  • Poly-Computer 880
  • BIC A5105

That’s the entire list of East-German “hobby computers”. But wait, there’s more:

  • Robotron PC-1715
  • A5120
  • A7150

These were GDR office computers. The 1715 was a CP/M compatible 8-bit PC, and the A7150 was a not-quite-compatible x86 IBM-PC clone. I’m actually not sure what the 5120 was, just that it was a big ugly box with built-in mono-chrome monitor.

Since all those systems are marked as “not working” in this list I wasn’t too enthusiastic yet, but I had to be sure. The latest MESS compiled out of the box on OSX, and it was easy to find the right ROM images in the net. So I started MESS with:

./mess64 kc85_3 -window

To my astonishment I watched a complete boot sequence into the operating system:

KC85/3 system shell


I also came across the JSMESS project before, which is a port of MESS to Javascript using emscripten. So my next step was to compile JSMESS and see whether the KC emulator works there as well. It booted, but didn’t accept any keyboard input :( After comparing the source code it dawned on me that JSMESS was far behind MESS, about 2 years to be exact. But this was a good excuse to dive a bit deeper into how MESS actually works, and the deeper I crawled the more impressed I became.

MESS had been derived from the well known MAME arcade machine emulator project, with the goal to extend the emulation to “real computers”. Later MESS merged with MAME again, so that today both projects compile from the same code base.

A specific emulated machine is called a “system driver” and can be described by just a few lines of code listing what CPU to use, the RAM and ROM areas, what ROM image to load, and what memory-mapped IO registers exist. You’ll also have to provide several callback routines for handling reads and write to IO addresses and to convert the system’s video memory into a standardized bitmap representation. For a very simple computer built from standard chips a working emulator can be plugged together in a couple of hours, but writing a complete and “cycle-perfect” emulator is of course still a tough challenge, especially if custom chips are used. The overall amount of research and implementation work that went into MESS is almost overwhelming. Pretty much every computer, every mass-produced chip that ever existed is emulated in there, often with all of their undocumented quirks!

Ok, back to the KC85/3: after analyzing the source code of the KC driver it quickly became clear that the keyboard input emulation was the toughest part, since this was where the original KC engineers were very “creative”. As far as I understood the several pages of email exchange which are included as comment in the MESS KC driver, the KC keyboard used a very exotic TV remote control chip to send serial pulses to the main unit (the KC had an external keyboard connected with a “very thin” wire, so it was very likely a simple serial connection). The base unit which received the signal didn’t have a “decoder chip” however, but used its universal Z80-CTC (timer) and -PIO (in/out) chips to decode the signal. Emulating this behaviour seems to be very tricky since a lot of KC emulators have yanky keyboard input (not registering key presses, or inserting random key codes when typing fast, etc…).

Since I didn’t get this to work reliably even after back-porting the latest keyboard input code from MESS (which somewhat works, but still has problems with random keys triggering), I decided to be a bit naughty and implement a shortcut (the “cycle-perfect” emulator purists will likely kill me for this heresy):

After the KC-ROM reads a keyboard scan-code through this tricky serial-pulse decoding described above, it converts the scan code to ASCII and writes it to memory location 0x1FD, and then sets bit 0 in memory location 0x1F8 to signal that a new key code is available. It also maintains a keyboard repeat counter in address 0x1FA. All of this can be gathered from the keyboard handling code in ROM (and is also explained in that very informative, very long comment in the source code). I’m basically “shortcutting” this with C code and write the ASCII code directly to 0x1FD and also handle the key repeat directly in C. The tricky serial decoding stuff in ROM is never triggered this way. With this hack the keyboard input is fairly responsive (sometimes the first key is swallowed, don’t know yet what’s up with this).

Next I had to fix the RGB colors which were off both in MESS and JSMESS (bright yellow-green looked more like puke-yellowish, and all other “inbetween colors” were off too), and I finally back-ported (and also optimized a bit) the video memory mapping code from MESS to JSMESS.

You can check all my changes here on GitHub: https://github.com/floooh/jsmess/tree/floh Right now a “reboot” is going on in the JSMESS project to bring it uptodate with the latest MESS version. I’ll wait with any pull-requests until this is finished and I refreshed my own fork as well. Also I will not try to contribute my “dirty hacks” back to the main code base of course, the MESS guys are right to insist on perfect emulation instead of shortcut hacks like the keyboard hack described above. But my (rather egoistic) main goal is to get my own games running on my web page, so I think I can get away with such hacks in my own fork.

The next challenge is to get all of my games running in JSMESS. This is harder than I thought. Part of the problem is that there exist several memory dump files which are not original. I found dump files with the wrong entry address, and dumps where others have implemented cheats and trainers. So far I’ve got 3 out of 7 games working. Getting the remaining 4 games into working condition might take a while since I may have to do some hardcore assembly debugging to find out what’s wrong.

Thankfully MESS has a completely assembler-level debugger built-in:

MESS Debugger

Re-constructing the program flow of this 25-year-old game which I wrote in machine code (instead of using a assembler) is actually quite a lot of fun, much easier than trying to reconstruct a program which was written in a high-level language and compiled to machine code. Subroutines often start at “even” addresses, and have a block of NOP instructions appended, in case I needed to add instructions when fixing bugs, strings are usually embedded right into the instruction sequence instead of a central “string pool”. Analyzing the program flow comes down to figuring out what a given subroutine does (drawing a sprite? handling keyboard input? updating the hiscore display?), and what variables are stored at specific memory addresses (for instance current live counter, current position, and so on).

What’s remarkable is how small the game code actually is, even though it is not very dense with all those NOPs inbetween and a lot of redundant code segments (e.g. I didn’t specifically care about code size). Of the about 12kByte of my (very simple) Pacman clone, only about 3.5 kByte are actual code. The entire game code fits on a single screen (marked in yellow here):

enter image description here

Finally, here’s the current result of this work: a JSMESS KC85/3 and KC85/4 emulator, and 3 of my old games running directly in the browser. Don’t try this on an iPhone though (or generally Safari). Firefox or an uptodate Chrome works very well:


Written with StackEdit.

8 Oct 2013

Farewell DirectX

Today I ported the OpenGL rendering code in Nebula3's bleeding edge branch back to Windows:

enter image description here

This is remarkable in 2 ways:

  1. It's the first time since around 1997 that I ported a significant amount of code to Windows. Usually it was from Windows towards another platform.
  2. This is also the end of DirectX in our code base (well almost, we're still using the xnamath.h header, which is a standalone header and doesn't require linking against any DX DLL).

Why do I think that this is remarkable:

It is the end of an era! In 1997 I ported Urban Assault from DOS to Windows95 and Direct3D5. This was just around the time when Windows started its career as a gaming platform. D3D5 was the first D3D version which didn't completely suck because it had the new DrawPrimitive API, before that, rendering commands had to be issued through an incredibly arcane "execute buffer" concept (theoretically a good idea if GPUs would have been able to directly parse this buffer, but terrible to use in real-world code). The Urban Assault port to D3D was pretty inefficient since we ported from a software rasterizer (with perspective correction and all that cool shit), and if I remember correctly we issued every single triangle through a single DrawPrimitive call (although that wasn't such a big deal at the time). And the only graphics card which had somewhat acceptable D3D support was the RIVA128 from an underdog company called nVidia (this was before their breakthrough TNT2), and the top dog was the 3dfx Voodoo2 which had much better support for Glide then for D3D. But since UA was published by Microsoft we had to be D3D-exclusive of course.

Since 1998 Direct3D was our primary rendering API, I dabbled around with OpenGL from time to time, but nothing serious. We made the jump to D3D7, D3D8, and finally D3D9. Each new version sucked less and less, and D3D9 is still a really good API. We never made the jump to D3D10 because of Microsoft's exceptionally stupid decision to not back-port D3D10 to Windows XP from Vista, and since Nebula was never about high-end rendering features but instead running on a broad range of old hardware we could never justify to add D3D10 support, since we couldn't give up D3D9.

And as silly as it sounds, this boneheaded Microsoft decision from 7 years ago is one important reason why I'm ditching D3D today. World-wide, WindowsXP is the fastest growing Windows version. It's growing a lot faster than Windows8. Don't believe me? See the Unity hardware stats page for a scary reality check:


The Chinese Dragon has awoken, and it is running exclusively on XP. WindowsXP is also very popular in Eastern Europe and the Middle East. So if you want to support markets east and south of Middle Europe you're basically fucked if you don't support XP.

Another important reason is streamlining the code base. The currently "interesting platforms" (browser and mobile) are all running some variant of POSIX+OpenGL. In this new world the Windows APIs are the exotics, and Microsoft doesn't exactly help the cause by repeating their errors of the past (limiting Windows Store apps to D3D11). By using a single rendering code base (and especially shader code base!) across all platforms we're reducing our technical debt in the future.

I have a fallback plan of course, because there are a few risks:

  • What if OpenGL driver quality on Windows is as bad as everybody says?
  • What if we need to support native Windows Store apps (as opposed to a WebGL version running embedded in a browser)?

The fallback plan has 2 stages:

  1. Use ANGLE which layers OpenGL ES2 with some important extensions over D3D9 or D3D11, this is the preferred solution since we don't need to touch the render layer code and shader library.
  2. If ANGLE isn't good enough, write native D3D9 and D3D11 ports of the CoreGraphics2 subsystem, and optimally use some API agnostic shader language wrapper. This wouldn't be as bad as it sounds, each wrapper would have around 7k lines of code, which is about 4.5% of Nebula3 in its minimal useful configuration (which is about 150k lines of code, depending on which other N3 modules are added this can go up to 500k lines of code).

OpenGL isn't perfect of course. It has some incredibly crufty corners, most of those have been fixed in through extensions and newer GL versions over time, but realistically we can't use anything newer then OpenGL ES2 with very few extensions for the renderer's base feature set.

When I removed the DirectX library stubs from the Nebula3 CMake files this afternoon I really had to stop and think for a moment. Who knows, maybe in a future blog post in about 15 years I will write "this was around the time when Windows became irrelevant as a gaming platform"? ;)

Written with StackEdit.