The Brain Dump

Moving to github

2016-01-09T16:17:00.001+01:00

I'm moving this blog over to github in an desperate attempt to finally keep all my web stuff in a single place (plus: keeping the blog in a version-controlled repo is so much nicer than Blogger, finally no more looking for the perfect blog editor, better control over the layout etc etc etc...).

The new URL is: http://floooh.github.io/

See you on the other side :)
-Floh.

New Adventures in 8-Bit Land

2014-11-10T00:41:00.001+01:00

Alan Cox’ recent announcement of his Unix-like operating system for old home computers got me thinking: wouldn’t it be cool write programs for the KC85/3 in C, a language it never officially supported?

For youngsters and Westerners: the KC85 home computer line was built in the 80’s in Eastern Germany, the most popular version, the KC85/3 had a 1.75MHz Z80-compatible CPU, and 16kByte each of general RAM, video RAM and ROM (so 32kByte RAM and 16kByte ROM). The ROM was split in half, 8kByte BASIC, 8kByte OS. Display was 320x256 pixels, a block of 8x4 pixels could have 1-out-of-16 foreground and 1-out-of-8 background colors. No sprite support, no dedicated sound chip, and the video RAM layout was extra-funky and had very slow CPU access.

MAME/MESS has rudimentary support for the KC85 line (and many other computers built behind the Iron Curtain) and I dabbled with the KC85 emulation in JSMESS a while ago, as can be seen here: http://www.flohofwoe.net/history.html. So far this dabbling was all about running old games on old (emulated) machines.

New Code on Old Machines

But what about running new code on old machines? And not just Z80 assembler code, but ‘modern’ C99 code?

Good 8-bit C compilers are surprisingly easy to find, since the Z80 lived on well into the 2000s for embedded systems. I first started looking for a Z80 LLVM backend, but after some more googling I decided to go for SDCC, which looks like the ‘industry standard’ for 8-bit CPUs and is still actively developed.

On OSX, a recent SDCC can be installed with brew:

> brew install sdcc

After I played a few minutes with the compiler I decided that starting right with C is a few steps too far.

MESS

First I had to get MESS running again. MESS is the son of MAME, focusing on vintage computer emulation instead of arcade machines. Since I last used it, MESS had been merged back into MAME, and development has been moved onto github: https://github.com/mamedev/mame

So first, git-clone and compile mess:

> git clone git@github.com:mamedev/mame.git mame
> cd mame
> make TARGET=mess

This produces a ‘mess64’ executable on OSX. Next KC85/3 and /4 system ROMs are needed, these can be found by googling for ‘kc85_3.zip MESS’ (for what it’s worth, I consider these ROMs abandonware). With the compiled mess and the ROMs, a KC85/3 session can now be started in MESS:

>./mess64 kc85_3 -rompath . -window -resolution 640x512

And here we go:

Getting stuff into MESS

Next we need to figure out how to get code onto the emulator. The KC85 operating system ‘CAOS’ (for **C**assete **A**ided **O**perating **S**ystem - yes even East-German engineers had a sense for humor) didn’t have an ‘executable format’ like ELF, instead raw chunks of code and data were loaded from cassette tapes into memory. There was however a standardised format of how the data was stored on tape. Divided into chunks of 128 bytes, with the first chunk being the header with information at which address to load the following data. This tape format has survived as the ‘KCC file format’, where the first 128-byte chunk looks like this (taken from the kc85.c MESS driver source code):

struct kcc_header
{
    UINT8   name[10];
    UINT8   reserved[6];
    UINT8   number_addresses;
    UINT8   load_address_l;
    UINT8   load_address_h;
    UINT8   end_address_l;
    UINT8   end_address_h;
    UINT8   execution_address_l;
    UINT8   execution_address_h;
    UINT8   pad[128-2-2-2-1-16];
};

A .KCC file can be loaded into MESS using the -quik command line arg, e.g.:

>./mess64 kc85_3 -quik test.kcc -rompath . -window -resolution 640x512

So if we had a piece of KC85/3 compatible machine code, and put it into a file with a 128-byte KCC header in front, we should be able to load this into the emulator.

The canonical ‘Hello World’ program for the KC85/3 looks like this in Z80 machine code:

0x7F 0x7F 'HELLO' 0x01
0xCD 0x03 0xF0 
0x23
'Hello World\n\r' 0x00
0xC9

That’s a complete ‘Hello World’ in 27 bytes! Put these bytes somewhere in the KC85’s RAM, and after executing the command ‘MENU’ a new menu entry will show up named ‘HELLO’. To execute the program, type ‘HELLO’ and hit Enter:

How does this magic work? At the start is a special ‘7F 7F’ header which identifies these 27 bytes as a command line program called ‘HELLO’:

0x7F 0x7F 'HELLO' 0x01

Execution starts right after the 0x01 byte:

0xCD 0x03 0xF0 
0x23

The CD is the machine code of the Z80 subroutine-call instruction, followed by the call-target address 0xF003 (the Z80 is little-endian, like the x86), this is a call to a central operating system ‘jump vector’, the 0x23 identifies the operating system function, in this case the function is OSTR for ‘Output STRing’ (see page 43 of the system manual). This function outputs a string to the current cursor position. The string is not provided as a pointer, but directly embedded into the code after the call and terminated with a zero byte:

'Hello World\n\r' 0x00

After the operating system function has executed, it will resume execution after the string’s 0-terminator byte.

The final C9 byte is the Z80 RETurn statement, which will give control back to the operating system.

This was the point where I started to write a bit of Python code which take a chunk of Z80 code, puts a KCC header in front and writes it to a .kcc file. And indeed the MESS loader accepted such a self-made ‘executable’ without problems.

Mnemonics

Before tackling the C programming challenge I decided to start smaller, with Z80 assembly code. The SDCC compiler comes (among others) with a Z80 assembler, but I found this hard to use (for instance, it generates intermediate ASCII Intel HEX files instead of raw binary files).

After some more googling I found z80asm which looked solid and easy to use. Again this can be installed via brew:

> brew install z80asm

The simple Hello World machine code blob from above looks like this in Z80 assembly mnemonics:

    org 0x200 ; start at address 0x200
    db 0x7F,0x7F,"HELLO",1 
    call 0xF003
    db 0x23
    db "Hello World\r\n\0"
    ret

Much easier to read right? And even with comments! Running this file through z80asm yields a binary files with the exact same 27 bytes as the hand-crafted machine code version:

> z80asm hello.s -o hello.bin
> hexdump hello.bin
0000000 7f 7f 48 45 4c 4c 4f 01 cd 03 f0 23 48 65 6c 6c
0000010 6f 20 57 6f 72 6c 64 0d 0a 00 c9               
000001b

With some more Python plumbing I was then able to ‘cross-assemble’ new programs for the KC85 in a modern development environment. Very cool!

C99

But the real challenge remains: compiling and running C code! Compiling a C source through SDCC generates a lot of output files, but none of them is the expected binary blob of executable code:

> sdcc hello.c 
> ls
hello.asm   hello.ihx   hello.lst   hello.mem   hello.rst
hello.c     hello.lk    hello.map   hello.rel   hello.sym

There’s 2 interesting files: hello.asm is a human-readable assembler source file, and hello.ihx is the final executable, but in Intel HEX format. The .ihx file can be converted into a raw binary blob using the makebin program also coming with SDCC.

But even with a very simple C program there’s already a few things off:
- global variables are placed at address 0x8000 (32kBytes into the address space), on the KC85/3 this is video memory so the default address for data wouldn’t work
- if any global variables are initialized, then the resulting binary file is also at least 32 kBytes big, and has a lot of empty space inside
- there’s a few dozen bytes of runtime initialization code which isn’t needed in the KC85 environment (at least as long as we don’t want to use the C runtime)

Thankfully SDCC allows to tweak all this and allows to compile (and link) pieces of C code into raw blobs of machine code without any ‘runtime overhead’, it doesn’t even need a main function to produce a valid executable.

Currently I’m placing global data at address 0x200, and code at address 0x300 (so there’s 256 bytes for global data), and I’m disabling anything C-runtime related. And of course we need to tell the compiler to generate Z80 code, these are the important command line options for sdcc:

-mz80
--no-std-crt0 --nostdinc --nostdlib
--code-loc 0x300
--data-loc 0x200

With these compiler settings I’m getting the bare-bones Z80 code I want on the KC85. All that’s left now is some macros and system call wrapper functions to provide a KC-style runtime enviornment, and TADAA:

C99 programming on a 30 year old 8-bit home computer :D

Here’s the github link to the ‘kc85sdk’ (work in progress):

https://github.com/floooh/kc85sdk

Cross-Platform Multitouch Input

2014-10-08T14:25:00.001+01:00

TL;DR: A look at the low-level touch-input APIs on iOS, Android NDK and emscripten, and how to unify them for cross-platform engines with links to source code.

Why

Compared to mouse, keyboard and gamepad, handling multi-touch input is a complex topic because it usually involves gesture recognition, at least for simple gestures like tapping, panning and pinching. When I worked on mobile platforms in the past, I usually tried to avoid processing low-level touch input events directly, and instead used gesture recognizers provided by the platform SDKs:

On iOS gesture recognizers are provided by the UIKit, they are attached to an UIView object, and when the gesture recognizer detects a gesture it will invoke a callback method. The details are here: GestureRecognizer_basics.html

The Android NDK itself has no built-in gesture recognizers, but comes with source code for a few simple gesture detectors in the ndk_helpers source code directory

There’s 2 problem with using SDK-provided gesture detectors. First, iOS and Android detectors behave differently. A pinch in the Android NDK is something slightly different then a pinch in the iOS SDK, and second, the emscripten SDK only provides the low-level touch events as provided by HTML5 Touch Event API, no high-level gesture recognizers.

So, to handle all 3 platforms in a common way, there doesn’t seem to be a way around writing your own gesture recognizers and trying to reduce the platform-specific touch event information into a platform-agnostic common subset.

Platform-specific touch events

Let’s first look at the low-level touch events provided by each platform in order to merge their common attributes into a generic touch event:

iOS touch events

On iOS, touch events are forwarded to UIView callback methods (more specifically, UIResponder, which is a parent class of UIView). Multi-touch is disabled by default and must be enabled first by setting the property multipleTouchEnabled to YES.

The callback methods are:

- touchesBegan:withEvent:
- touchesMoved:withEvent:
- touchesEnded:withEvent:
- touchesCancelled:withEvent:

All methods get an NSSet of UITouch object as first argument and an UIEvent as second argument.

The arguments are a bit non-obvious: the set of UITouches in the first argument is not the overall number of current touches, but only the touches that have changed their state. So if there’s already 2 fingers down, and a 3rd finger touches the display, a touchesBegan will be received with a single UITouch object in the NSSet argument, which describes the touch of the 3rd finger that just came down. Same with touchEnded and touchMoved, if one of 3 fingers goes up (or moves), the NSSet will only contain a single UITouch object for the finger that has changed its state.

The overall number of current touches is contained in the UIEvent object, so if 3 fingers are down, the UIEvent object contains 3 UITouch objects. The 4 callback methods and the NSSet argument are actually redundant, since all that information is also contained in the UIEvent object. A single touchesChanged callback method with a single UIEvent argument would have been enough to communicate the same information.

Let’s have a look at the information provided by UIEvent, first there’s the method allTouches which returns an NSSet of all UITouch objects in the event and there’s a timestamp when the event occurred. The rest is contained in the returned UITouch objects:

The UITouch method locationInView provides the position of the touch, the phase value gives the current state of the touch (began, moved, stationary, ended, cancelled). The rest is not really needed or specific to the iOS platform.

Android NDK touch events

On Android, I assume that the Native Activity is used, with the android_native_app_glue.h helper classes. The application wrapper class android_app allows to set a single input event callback function which is called whenever an input event occurs. Android NDK input events and access functions are defined in the “android/input.h” header. The input event struct AInputEvent itself is isn’t public, and can only be accessed through accessor functions defined in the same header.

When an input event arrives at the user-defined callback function, first check whether it is actually a touch event:

int32_t type = AInputEvent_getType(aEvent);
if (AINPUT_EVENT_TYPE_MOTION == type) {
  // yep, a touch event
}

Once it’s sure that the event is a touch event, the AMotionEvent_ set of accessor functions must be used to extract the rest of the information. There’s a whole lot of them, but we’re only interested in the attributes that are also provided by other platforms:

AMotionEvent_getAction();
AMotionEvent_getEventTime();
AMotionEvent_getPointerCount();
AMotionEvent_getPointerId();
AMotionEvent_getX();
AMotionEvent_getY();

Together, these functions provide the same information as the iOS UIEvent object, but the information is harder to extract.

Let’s start with the simple stuff: A motion event contains an array of touch points, called ‘pointers’, one for each finger touching the display. The number of touch points is returned by the AMotionEvent_getPointerCount() function, which takes an AInputEvent* as argument. The accessor functions AMotionEvent_getPointerId(), AMotionEvent_getX() and AMotionEvent_getY() take an AInputEvent* and an index to acquire an attribute of the touch point at the specified index. AMotionEvent_getX()/getY() extract the X/Y position of the touch point, and the AMotionEvent_getPointerId() function returns a unique id which is required to track the same touch point across several input events.

AMotionEvent_getAction() provides 2 pieces of information in a single return value: the actual ‘action’, and the index of the touch point this action applies to:

The lower 8 bits of the return value contain the action code for a touch point that has changed state (whether a touch has started, moved, ended or was cancelled):

AMOTION_EVENT_ACTION_DOWN
AMOTION_EVENT_ACTION_UP
AMOTION_EVENT_ACTION_MOVE
AMOTION_EVENT_ACTION_CANCEL
AMOTION_EVENT_ACTION_POINTER_DOWN
AMOTION_EVENT_ACTION_POINTER_UP

Note that there are 2 down events, DOWN and POINTER_DOWN. The NDK differentiates between ‘primary’ and ‘non-primary pointers’. The first finger down generates a DOWN event, the following fingers POINTER_DOWN events. I haven’t found a reason why these should be handled differently, so both DOWN and POINTER_DOWN events are handled the same in my code.

The upper 24 bits contain the index (not the identifier!) of the touch point that has changed its state.

emscripten SDK touch events

Touch input in emscripten is provided by the new HTML5 wrapper API in the ‘emscripten/html5.h’ header which allows to set callback functions for nearly all types of HTML5 events (the complete API documentation can be found here.

To receive touch-events, the following 4 functions are relevant:

emscripten_set_touchstart_callback()
emscripten_set_touchend_callback()
emscripten_set_touchmove_callback()
emscripten_set_touchcancel_callback()

These set the application-provided callback functions that are called when a touch event occurs.

There’s a caveat when handling touch input in the browser: usually a browser application doesn’t start in fullscreen mode, and the browser itself uses gestures for navigation (like scrolling, page-back and page-forward). The emscripten API allows to refine the events to specific DOM elements (for instance the WebGL canvas of the application instead of the whole HTML document), and the callback can decide to ‘swallow’ the event so that standard handling by the browser will be supressed.

The first argument to the callback setter functions above is a C-string pointer identifying the DOM element. If this is a null pointer, events from the whole webpage will be received. The most useful value is “#canvas”, which limits the events to the (WebGL) canvas managed by the emscripten app.

In order to suppress default handling of an event, the event callback function should return ‘true’ (false if default handling should happen, but this is usually not desired, at least for games).

The touch event callback function is called with the following arguments:

int eventType,
const EmscriptenTouchEvent* event
void* userData

eventType will be one of:

EMSCRIPTEN_EVENT_TOUCHSTART
EMSCRIPTEN_EVENT_TOUCHEND
EMSCRIPTEN_EVENT_TOUCHMOVE
EMSCRIPTEN_EVENT_TOUCHCANCEL

The 4 different callbacks are again kind of redundant (like in iOS), it often makes sense to route all 4 callbacks to the same handler function and differentiate there through the eventType argument.

The actual touch event data is contained the EmscriptenTouchEvent structure, interesting for us is the member int numTouches and an array of EmscriptenTouchPoint structs. A single EmscriptenTouchPoint has the fields identifier, isChanged and the position of the touch in canvasX, canvasY (other member omitted for clarity).

Except for the timestamp of the event, this is the same information provided by the iOS and Android NDK touch APIs.

Bringing it all together

The cross-section of all 3 touch APIs provides the following information:

a notification when the touch state changes:
- a touch-down was detected (a new finger touches the display)
- a touch-up was detected (a finger was lifted off the display)
- a movement was detected
- a cancellation was detection
information about all current touch points, and which of them has changed state
- the x,y position of the touch
- a unique identifier in order to track the same touch point over several input events

The touch point identifier is a bit non-obvious in the iOS API since the UITouch class doesn’t have an identifier member. On iOS, the pointer to an UITouch object serves as the identifier, the same UITouch object is guaranteed to exist as long as the touch is active.

Also, another crucial piece of information is the timestamp when the event occurred. iOS and Android NDK provide this with their touch events, but not the emscripten SDK. Since the timestamps on Android and iOS have different meaning anyway, I’m simply tracking my own time when the events are received.

My unified, platform-agnostic touchEvent now basically looks like this:

struct touchEvent {
    enum touchType {
        began,
        moved,
        ended,
        cancelled,
        invalid,
    } type = invalid;
    TimePoint time;
    int32 numTouches = 0;
    static const int32 MaxNumPoints = 8;
    struct point {
        uintptr identifier = 0;
        glm::vec2 pos;
        bool isChanged = false;
    } points[MaxNumPoints];
}

TimePoint is an Oryol-style timestamp object. The uintptr datatype for the identifier is an unsigned integer with the size of a pointer (32- or 64-bit depending on platform).

Platform-specific touch events are received, converted to generic touch events, and then fed into custom gesture recognizers:

Simple gesture detector source code:
- tap detector
- panning detector
- pinch detector

And a simple demo (the WebGL version has only been tested on iOS8, mobile Safari’s WebGL implementation still has bugs):
- WebGL demo
- Android self-signed APK

And that’s all for today :)

Shader Compilation and IDEs

2014-05-24T21:02:00.001+01:00

I recently played around with shader code generation and the GLSL reference compiler in Oryol.

The result is IMHO pretty neat:

Shader source files (*.shd) live in the IDE next to C++ files:

Shader files are written in normal GLSL syntax with custom annotations (those @ and $ tags):

When compiling the project, a custom build step will generate vertex- and fragment shaders for different GLSL versions and run them through the GLSL reference compiler. Any errors from the reference compiler are converted to a format which can be parsed by the IDE:

Error parsing also works in Visual Studio:

Unfortunately I couldn’t get error parsing to work in QtCreator on Linux. The error messages are recognised, but double-clicking them doesn’t work.

After the GLSL compiler pass, a C++ header/source file pair will be created which contains the GLSL shader code and some C++ glue to make the shader accessible from the engine side.

The edit-compile-test cycle is only one or two seconds, depending on the link time of the demo code. Also, since the shader generation runs as a normal build step, shader code will also be generated and validated in command line builds.

Here’s how it works:

When cmake runs to create the build files it will look for XML files in the source code directories. For each XML file, a custom build target will be created which invokes a python script. This ‘generator script’ will generate a C++ header/source pair during compilation.

This generic code generation has only been used so far for the Oryol Messaging system, but it is flexible enough to cover other code generation scenarios (like generating shader code).

Setting up the custom build target involves 3 steps:

The actual build target must be created, cmake has the add_custom_target macro for this:

add_custom_target(${target}_gen 
  COMMAND ${PYTHON}
  ${ORYOL_ROOT_DIR}/generators/generator.py
  ${xmlFiles}
  COMMENT "Generating sources...")

This statement takes a variable target with the name of the build target which will compile the generated C++ sources plus a xmlFiles list variable and it will generate a new build target called [target]_gen The variables PYTHON and ORYOL_ROOT_DIR are config variables pointing to the python executable and the Oryol root directory.

To get the right build order, a target dependency must be defined so that the generated target is always run before the build target which needs the generated C++ source code:

add_dependencies(${target} ${target}_gen)

Finally we need to resolve a chicken-egg situation. All C++ files must exist when cmake assembles the build files, but the generated C++ files will only be created during the first build. To fix this situation, empty placeholder files are created if the generated sources don’t exist yet:

foreach(xmlFile ${xmlFiles})
    string(REPLACE .xml .cc src ${xmlFile})
    string(REPLACE .xml .h hdr ${xmlFile})
    if (NOT EXISTS ${src})
        file(WRITE ${src} " ")
    endif()
    if (NOT EXISTS ${hdr})
        file(WRITE ${hdr} " ")
    endif()
endforeach()

These 3 steps take care of the build configuration via cmake.

On to the python generator script:

First, the generator script parses the XML ‘source file’ which caused its invokation. For the shader generator, the XML file is very simple:

<Generator type="ShaderLibrary" name="Shaders" >
    <AddDir path="shd"/>
</Generator>

The most important piece is the AddDir tag which tells the generator script where it finds the actual shader source files. More then one AddDir can be added if the shader sources are spread over different directories.

Generator scripts must also include a dirty-check and only actually overwrite the target C++ files when the source files (in this case: the XML file and all shader sources) are newer then the target sources to prevent unneeded compilation of dependent files.

Shader File Parsing

Shader files will be processed by a simple line-parser:

comments and white-space will be removed
find and process ‘@’ and ‘$’ keywords
gather GLSL code lines and keep track of their source file and line numbers (this is important for mapping error messages back later)

A very minimal shader file looks like this:

@vs MyVertexShader
@uniform mat4 mvp ModelViewProj
@in vec4 position
@in vec2 texcoord0
@out vec2 uv
void main() {
  $position = mvp * position;
  uv = texcoord0;
@end

@fs MyFragmentShader
@uniform sampler2D tex Texture
@in vec2 uv
void main() {
  $color = $texture2D(tex, uv);
}
@end

@bundle Main
@program MyVertexShader MyFragmentShader
@end

This defines one vertex shader (between the @vs and @end tags) and a matching fragment shader (between @fs and @end). The vertex shader defines a 4x4 matrix uniform with the GLSL variable name mvp and the ‘bind name’ ModelViewProj, and it expects position and texture coordinates from the vertex. The vertex shader transforms the vertex-position into the special variable $position and forwards the texture coordinate to the fragment shader.

The fragment shader defines a texture sampler uniform with the GLSL variable name tex and the bind name Texture. It takes the texture coordinates emitted by the vertex shader, samples the texture and writes the color into the special variable $color.

Finally a shader @bundle with the name ‘Main’ is defined, and one shader program created from the previously defined vertex- and fragment-shader is attached to the bundle. A shader bundle is an Oryol-specific concept and is simply a collection of one or more shader programs that are related to each other.

Another useful tag which isn’t used in this simple example are the @block and @use tag. A @block encapsulates a piece of code which can then later be included with a @use tag in other blocks or vertex-/fragment-shaders. This is basically the missing #include mechanism for GLSL files.

Here’s some @block sample code, first a Util block is defined with general utility functions, then a block VSLighting which would contain lighting functions for vertex shaders, and FSLighting with lighting functions for fragment shaders. Both VSLighting and FSLighting want to use functions from the Util block (via @use Util). Finally the vertex- and fragment-shaders would contain a @use VSLighting and @use FSLighting (not shown). The shader code generator would then resolve all block dependencies and include the required the code blocks in the generated shader source in the right order:

@block Util
// general utility functions
vec4 bla() {
  vec4 result;
  ...
  return result;
}
@end

@block VSLighting
// lighting functions for the vertex shader
@use Util
vec4 vsBlub() {
  return bla();
}
@end 

@block FSLighting
// lighting functions for the fragment shader
@use Util
vec4 fsBlub() {
  return bla();
}
@end

GLSL Code Generation and Validation

From the ‘tagged shader source’, the shader generator script will create actual vertex- and fragment-shader code for different GLSL versions and feed it to the reference compiler for validation.

For instance, the above simple vertex/fragment-shader source would produce the following GLSL 1.00 source code (for OpenGLES2 and WebGL):

uniform mat4 mvp;
attribute vec4 position;
attribute vec2 texcoord0;
varying vec2 uv;
void main() {
  gl_Position = mvp * position;
  uv = texcoord0;
}

The output for a more modern GLSL version would look slightly different:

#version 150
uniform mat4 mvp;
in vec4 position;
in vec2 texcoord0;
out vec2 uv;
void main() {
  gl_Position = mvp * position;
  uv = texcoord0;
}

The GLSL reference compiler is called once per GLSL version and vertex-/fragment-shader and the resulting output is captured into a string variable. The python code to start an exe and capture its output looks like this:

child = subprocess.Popen([exePath, glslPath], stdout=subprocess.PIPE)
out = ''
while True :
    out += child.stdout.read()
    if child.poll() != None :
        break
return out

The output will then be parsed for error messages and error line numbers. Since these line-numbers are pointing into the generated source code they are not useful themselves but must be mapped back to the original source-file-path and line-numbers. This is why the line-parser had to store this information with each extracted source code line.

The mapped source-file-path, line-number and error message must then be formatted into the gcc/clang- or VStudio-error-message format, and if an error occurs, the python script will terminate with an error code so that the build is stopped:

if platform.system() == 'Windows' :
    print '{}({}): error: {}'.format(FilePath, LineNumber + 1, msg)
else :
    print '{}:{}: error: {}\n'.format(FilePath, LineNumber + 1, msg)
if terminate:
    sys.exit(10)

This formatting works for Xcode and VisualStudio. The error is displayed by the IDE and can be double-clicked to position the text cursor over the right source code location. It doesn’t work in Qt Creator yet unfortunately, and I haven’t tested Eclipse yet.

Another thing to keep in mind is that build jobs can run in parallel. At first I was writing the intermediate GLSL files for the reference compiler into files with simple filenames (like ‘vs.vert’ and ‘fs.frag’). This didn’t cause any problems when doing trivial tests, but once I had converted all Oryol samples to use the shader generator I was sometimes getting weird errors from the reference compiler which didn’t make any sense at first.

The problem was that build jobs were running at the same time and overwrote each others intermediate files. The solution was to use randomized filenames which cannot collide. As always, python has a module just for this case called ‘tempfiles’:

# this writes to a new temp vertex shader file 
f = tempfile.NamedTemporaryFile(suffix='.vert', delete=False)
writeFile(f, lines)
f.close()

# call the validator
...

# delete the temp file when done
os.unlink(f.name)

The C++ Side

Last but not least a quick look at the generated C++ source code. The C++ header defines a namespace with the name of the shader-library, and one class per shader-bundle. The very simple vertex/fragment-shader sample from above would generate a header like this:

#pragma once
/*  #version:1#
    machine generated, do not edit!
*/
#include "Render/Setup/ProgramBundleSetup.h"
namespace Oryol {
namespace Shaders {
class Main {
public:
  static const int32 ModelViewProj = 0;
  static const int32 Texture = 1;
  static Render::ProgramBundleSetup CreateSetup();
};
}
}

Note the ModelViewProj and Texture constant definitions. These are used to set the uniform values in the C++ render loop.

How this code is actually used for rendering is a topic of its own. For now let me just point to the Oryol sample source code:

https://github.com/floooh/oryol/tree/master/code/Samples/Render

What’s next

The existing shader tags are already quite useful but only the beginning. The real problem I want to solve is to manage slightly differing variations of the same shader. For instance there might exist a specific high-level material, which must be applied to static and skinned geometry (2 variations), can cast shadows (4 variations: static shadow caster, skinned shadow caster), should be available in a forward-renderer and deferred-renderer (== many more slightly different shader variations). Sometimes an ueber-shader approach is better, and sometimes actually separate shaders for each variation are better.

The guts of those material shaders are always built from the same small code fragments, just arranged and combined differently.

Hopefully a couple of new ‘@’ and ‘$’ tags will be enough, but how this will look like in detail I don’t know yet. One inspiration are web-template engines which build web pages from a set of templates and rules. Another inspiration are the existing connect-the-dots shader editors (even though I want to keep the focus on ‘shaders-as-source-code’, not ‘shader-as-data’, but some limited runtime-code-generation would still make sense).

And of course the right middle-ground between ‘modern GLSL’ and ‘legacy GLSL’ must be found. Unfortunately OpenGL ES2 / WebGL1.0 will have to be the foundation for quite some time.

And that’s all for today :)

cmake and the Android NDK

2014-04-20T19:51:00.001+01:00

TL;DR: how to build Android NDK applications with cmake instead of the custom NDK build system, this is useful for projects which already use cmake to create multiplatform/cross-compiling build files.

Update: Thanks to thp for pointing out a rather serious bug: packaging the standard shared libraries into the APK should NOT be necessary since these are pre-installed on the device. I noticed that I didn’t set a library search path to the toolchain lib dir in the linker step (-L…) which might explain the crash I had earlier, but unfortunately I can’t reproduce this crash anymore with the old behaviour (no library search path and no shared system libraries in the APK). I’ll keep an eye on that and update the blog post with my findings.

I’ve spent the last 2.5 days adding Android support to Oryol’s build system. This wasn’t exactly on my to-do list until I sorta “impulse-bought” a Nexus7 tablet last Thursday. It basically went like this “hey that looks quite neat for a non-iPad tablet => wow, scrolling feels smooth, very non-Android-like => holy shit it runs my Oryol WebGL samples at 60fps => hmm 179 Euros seems quite reasonable…” - I must say I’m impressed how far the Android “user experience” has come since I last dabbled with it. The UI finally feels completely smooth, and I didn’t have any of those Windows8-Metro-style WTF-moments yet.

Ok, so the logical next step would be to add support for Android to the Oryol build system (if you don’t know what Oryol is: it’s a new experimental C++11 multi-plat engine I started a couple months ago: https://github.com/floooh/oryol).

The Oryol build system is cmake-based, with a python script on top which simplifies managing the dozens of possible build-configs. A build-config is one specific combination of target-platform (osx, ios, win32, win64, …), build-tools (make, ninja, Visual Studio, Xcode, …) and compile-mode (Release, Debug) stored under a descriptive name (e.g. osx-xcode-debug, win32-vstudio-release, emscripten-make-debug, …).

The front-end python script called ‘oryol’ is used to juggle all the build-configs, invoke cmake with the right options, and perform command line builds.

One can for instance simply call:

> ./oryol update osx-xcode-debug

…to generate an Xcode project.

Or to perform a command line build with xcodebuild instead:

> ./oryol build osx-xcode-debug

Or to build Oryol for emscripten with make in Release mode (provided the emscripten SDK has been installed):

> ./oryol build emscripten-make-release

This also works on Windows (32- or 64-bit):

> oryol build win64-vstudio-debug
> oryol build win32-vstudio-debug

…or on Linux:

> ./oryol build linux-make-debug

Now, what I want to do with my shiny new Nexus7 is of course this:

> ./oryol build android-make-debug

This turned out to be harder then usual. But lets start at the beginning:

A cross-compiling scenario is normally well defined in the GCC/cmake world:

A toolchain wraps the target-platform’s compiler tools, system headers and libs under a standardized directory structure:

The compiler tools usually reside in a bin subdirectory, and are called gcc and g++, or in the LLVM world: clang and clang++, sometimes the tools also have a prefix: pnacl-clang and pnacl-clang++), or they have completely different names (like emcc in the emscripten SDK).

Headers and libs are often located in a usr directory (usr/include and usr/lib).

The toolchain headers contain at least the the C-Runtime headers, like stdlib.h, stdio.h and usually the C++ headers (vector, iostream, …) and often also the OpenGL headers and other platform-specific header files.

Finally the lib directory contains precompiled system libraries for the target platform (for instance libc.a, libc++.a, etc…).

With such a standard gcc-style toolchain, cross-compilation is very simple. Just make sure that the toolchain-compiler tools are called instead of the host platform’s tools, and that the toolchain headers and libs are used.

cmake standardizes this process with its so-called toolchain-files. A toolchain-file defines what compilers tools, headers and libraries should be used instead of the ‘default’ ones, and usually also overrides compile and linker flags.

The typical strategy when adding a new target platform to a cmake build system looks like this:

setup the target platform’s SDK
create a new toolchain file (obviously)
tell cmake where to find the compiler tools, header and libs
add the right compile and linker flags

Once the toolchain file has been created, call cmake with the toolchain file:

> cmake -G"Unix Makefiles" -DCMAKE_TOOLCHAIN_FILE=[path-to-toolchain-file] [path-to-project]

Then run make in verbose mode to check whether the right compiler is called, and with the right options:

> make VERBOSE=1

This approach works well for platforms like emscripten or Google Native Client. Some platforms require a bit of additional cmake-magic, a Portable Native Client executable for instance must be “finalized” after it has been linked. Additional build steps like these can be added easily in cmake with the add_custom_command macro.

Integrating Android as a new target platform isn’t so easy though:

the Android SDK itself only allows to create pure Java applications, for C/C++ apps, the separate Android NDK (Native Development Kit) is required
the NDK doesn’t produce complete Android applications, it needs the Android Java SDK for this
native Android code isn’t a typical executable, but lives in a shared library which is called from Java through JNI
the Android SDK and NDK both have their own build systems which hide a lot of complexity
…this complexity comes from the combination of different host platforms (OSX, Linux, Windows), target API levels (android-3 to android-19, roughly corresponding to Android versions), compiler versions (gcc4.6, gcc4.9, clang3.3, clang3.4), and finally CPU architectures and instruction sets (ARM, MIPS, X86, with several variations for ARM (armv5, armv7, with or without NEON, etc…)
C++ support is still bolted on, the C++ headers and libs are not in their standard locations
the NDK doesn’t follow the standard GCC toolchain directory structure at all

The custom build system coming with the NDK does a good job to hide all this complexity, for instance it can automatically build for all CPU architectures, but it stops after the native shared library has been compiled: it cannot create a complete Android APK. For this, the Android Java SDK tools must be called from the command line.

So back to how to make this work in cmake:

The plan looks simple enough:

compile our C/C++ code into a shared library instead of an executable
somehow get this into a Java APK package file…
…deploy APK to Android device and run it

Step 1 starts rather innocent, create a toolchain file, look up the paths to the compiler tools, headers and libs in the NDK, then lookup the compiler and linker command line args by watching a verbose build. Then put all this stuff into the right cmake variables. At least this is how it usually works. Of course for Android it’s all a bit more complicated:

first we need to decide on a target CPU architecture and what compiler to use. I settled for ARM and gcc4.8, which leads us to […]/android-ndk-r9d/toolchains/arm-linux-androideabi-4.8/prebuilt
in there is a directory darwin-x86_64 so we need separate paths by host platform here
finally in there is a bin directory with the compiler tools, so GCC would be for instance at [..]/android-ndk-r9d/toolchains/arm-linux-androideabi-4.8/prebuilt/darwin-x86_64/bin/arm-linux-androideabi-gcc
there’s also an include, lib and share directory but the stuff in there definitely doesn’t look like system headers and libs… bummer.
the system headers and libs are under the platforms directory instead: [..]/android-ndk-r9d/platforms/android-19/arch-arm/usr/include, and [..]/android-ndk-r9d/platforms/android-19/arch-arm/usr/lib
so far so good… put this stuff into the toolchain file and it seems to compile fine – until the first C++ header must be included - WTF?
on closer inspection, the system include directory doesn’t contain any C++ headers, and there’s different C++ lib implementations to choose from under [..]/android-ndk-r9d/sources/cxx-stl

This was the point where was seriously thinking about calling it a day until I stumbled across the make-standalone-toolchain.sh in build/tools. This is a helper script which will build a standard GCC-style toolchain for one specific Android API-level and target CPU:

sh make-standalone-toolchain.sh –-platform=android-19 
  –-ndk-dir=/Users/[user]/android-ndk-r9d
  –-install-dir=/Users/[user]/android-toolchain 
  –-toolchain=arm-linux-androideabi-4.8
  --system=darwin-x86_64

This will extract the right tools, headers and libs, and also integrate C++ headers (by default gnustl, but can be selected with the –stl option). When the script is done, a new directory ‘android-toolchain’ has been created which follows the GCC toolchain standard, and is much easier to integrate with cmake:

The important directories are:
- [..]/android-toolchain/bin, this is where the compiler tools are located, these are still prefixed though (e.g. arm-linux-androideabi-gcc
- [..]/android-toolchain/sysroot/usr/include CRT headers, plus EGL, GLES2, etc…, but NOT the C++ headers
- [..]/android-toolchain/include the C++ headers are here, under ‘c++’
- [..]/android-toolchain/sysroot/usr/lib .a and .so system libs, libstc++.a/.so is also here, no idea why

After setting these paths in the toolchain file, and telling cmake to create shared-libs instead of exes when building for the Android platform I got the compiler and linker steps. Instead of a CoreHello executable, I got a libCoreHello.so. So far so good.

Next step was to figure out how to get this .so into a APK which can be uploaded to an Android device.

The NDK doesn’t help with this, so this is where we need the Java SDK tools, which uses yet another build system: ant. From looking at the SDK samples I figured out that it is usually enough to call ant debug or ant release within a sample directory to build an .apk file into a bin subdirectory. ant requires a build.xml file which defines the build tasks to perform. Furthermore, Android apps have an embedded AndroidManifest.xml file which describes how to run the application, and what privileges it requires. None of these exist in the NDK samples directories though…

After some more exploration it became clear: The SDK has a helper script called android which is used (among many other things) to setup a project directory structure with all required files for ant to create a working APK:

> android create project
    --path MyApp
    --target android-19
    --name MyApp
    --package com.oryol.MyApp
    --activity MyActivity

This will setup a directory ‘MyApp’ with a complete Android Java skeleton app. Run ‘ant debug’ in there and it will create a ‘MyApp-debug.apk’ in the ‘bin’ subdirectory which can be deployed to the Android device with ‘adb install MyApp-debug.apk’, which when executed displays a ‘Hello World, MyActivity’ string.

Easy enough, but there are 2 problems, first: how to get our native shared library packaged and called?, and second: the Java SDK project directory hierarchy doesn’t really fit well into the source tree of a C/C++ project. There should be a directory per sample app with a couple of C++ files and a CMakeLists.txt file and nothing more.

The first problem is simple to solve: the project directory hierarchy contains a libs directory, all .so files in there will be copied into the APK by ant (to verify this: a .apk is actually a zip file, simply changed the file extension to zip and peek into the file). One important point: the lib directory contains one sub-directory-level for the CPU architecture, so once we start to support multiple CPU instruction sets we need to put them into subdirectories like this:

FlohOfWoe:libs floh$ ls
armeabi     armeabi-v7a mips        x86

Since my cmake build-system currently only supports building for armeabi-v7a I’ve put my .so file in the armeabi-v7a subdirectory.

Now I thought that I had everything in place, I got an APK file with my native code .so lib in it, I used the NativeActivity and the android_native_app_glue.h approach, and logged out a “Hello World” to the system log (which can be inspected with adb logcat from the host system).

And still the App didn’t start, instead this showed up in the log:

D/AndroidRuntime(  482): Shutting down VM
W/dalvikvm(  482): threadid=1: thread exiting with uncaught exception (group=0x41597ba8)
E/AndroidRuntime(  482): FATAL EXCEPTION: main
E/AndroidRuntime(  482): Process: com.oryol.CoreHello, PID: 482
E/AndroidRuntime(  482): java.lang.RuntimeException: Unable to start activity ComponentInfo{com.oryol.CoreHello/android.app.NativeActivity}: java.lang.IllegalArgumentException: Unable to load native library: /data/app-lib/com.oryol.CoreHello-1/libCoreHello.so
E/AndroidRuntime(  482):    at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2195)

This was the second time where I banged my head against the wall for a while until I started to look into how linker dependencies are resolved for the shared library. I was pretty sure that I gave all the required libs on the linker command line (-lc -llog -landroid, etc), the error was that I assumed that these are linked statically. Instead default linking against system libraries is dynamic. The ndk-depends helps in finding the dependencies:

localhost:armeabi-v7a floh$ ~/android-ndk-r9d/ndk-depends libCoreHello.so 
libCoreHello.so
libm.so
liblog.so
libdl.so
libc.so
libandroid.so
libGLESv2.so
libEGL.so

~~This is basically the list of .so files which must be contained in the APK. After I copied these to the SDK project's lib directory, together with my libCoreHello.so~~. Update: These shared libs are not supposed to be packaged into the APK! Instead the standard system shared libraries which already exist on the device should be linked at startup.

I finally saw the sweet, sweet ‘Hello World!’ showing up in the adb log!

But I skipped one important part: so far I fixed everything manually, but of course I want automated Android batch builds, and without having those ugly Android skeleton project files in the git repository.

To solve this I did a bit of cmake-fu:

Instead of having the Android SDK project files committed into version control, I’m treating these as temporary build files.

When cmake runs for an Android build target, it does the following additional steps:

For each application target, a temporary Android SDK project is created in the build directory (basically the ‘android create project’ call described above):

# call the android SDK tool to create a new project
execute_process(COMMAND ${ANDROID_SDK_TOOL} create project
                --path ${CMAKE_CURRENT_BINARY_DIR}/android
                --target ${ANDROID_PLATFORM}
                --name ${target}
                --package com.oryol.${target}
                --activity DummyActivity
                WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR})

The output directory for the shared library linker step is redirected to the ‘libs’ subdirectory of this skeleton project:

# set the output directory for the .so files to point to the android project's 'lib/[cpuarch] directory
set(ANDROID_SO_OUTDIR ${CMAKE_CURRENT_BINARY_DIR}/android/libs/${ANDROID_NDK_CPU})
set_target_properties(${target} PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${ANDROID_SO_OUTDIR})
set_target_properties(${target} PROPERTIES LIBRARY_OUTPUT_DIRECTORY_RELEASE ${ANDROID_SO_OUTDIR})
set_target_properties(${target} PROPERTIES LIBRARY_OUTPUT_DIRECTORY_DEBUG ${ANDROID_SO_OUTDIR})

~~The required system shared libraries are also copied there:~~ (DON’T DO THIS, normally the system’s standard shared libraries should be used)

# copy shared libraries over from the Android toolchain directory
# FIXME: this should be automated as post-build-step by invoking the ndk-depends command
# to find out the .so's, and copy them over
file(COPY ${ANDROID_SYSROOT_LIB}/libm.so DESTINATION ${ANDROID_SO_OUTDIR})
file(COPY ${ANDROID_SYSROOT_LIB}/liblog.so DESTINATION ${ANDROID_SO_OUTDIR})
file(COPY ${ANDROID_SYSROOT_LIB}/libdl.so DESTINATION ${ANDROID_SO_OUTDIR})
file(COPY ${ANDROID_SYSROOT_LIB}/libc.so DESTINATION ${ANDROID_SO_OUTDIR})
file(COPY ${ANDROID_SYSROOT_LIB}/libandroid.so DESTINATION ${ANDROID_SO_OUTDIR})
file(COPY ${ANDROID_SYSROOT_LIB}/libGLESv2.so DESTINATION ${ANDROID_SO_OUTDIR})
file(COPY ${ANDROID_SYSROOT_LIB}/libEGL.so DESTINATION ${ANDROID_SO_OUTDIR})

The default AndroidManifest.xml file is overwritten with a customized one:

# override AndroidManifest.xml 
file(WRITE ${CMAKE_CURRENT_BINARY_DIR}/android/AndroidManifest.xml
    "<manifest xmlns:android=\"http://schemas.android.com/apk/res/android\"\n"
    "  package=\"com.oryol.${target}\"\n"
    "  android:versionCode=\"1\"\n"
    "  android:versionName=\"1.0\">\n"
    "  <uses-sdk android:minSdkVersion=\"11\" android:targetSdkVersion=\"19\"/>\n"
    "  <uses-feature android:glEsVersion=\"0x00020000\"></uses-feature>"
    "  <application android:label=\"${target}\" android:hasCode=\"false\">\n"
    "    <activity android:name=\"android.app.NativeActivity\"\n"
    "      android:label=\"${target}\"\n"
    "      android:configChanges=\"orientation|keyboardHidden\">\n"
    "      <meta-data android:name=\"android.app.lib_name\" android:value=\"${target}\"/>\n"
    "      <intent-filter>\n"
    "        <action android:name=\"android.intent.action.MAIN\"/>\n"
    "        <category android:name=\"android.intent.category.LAUNCHER\"/>\n"
    "      </intent-filter>\n"
    "    </activity>\n"
    "  </application>\n"
    "</manifest>\n")

And finally, a custom build-step to invoke the ant-build tool on the temporary skeleton project to create the final APK:

if ("${CMAKE_BUILD_TYPE}" STREQUAL "Debug")
    set(ANT_BUILD_TYPE "debug")
else()
    set(ANT_BUILD_TYPE "release")
endif()
add_custom_command(TARGET ${target} POST_BUILD COMMAND ${ANDROID_ANT} ${ANT_BUILD_TYPE} WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/android)

With all this in place, I can now do a:

> ./oryol make CoreHello android-make-debug

To compile and package a simple Hello World Android app!

What’s currently missing is a simple wrapper to deploy and run an app on the device:

> ./oryol deploy CoreHello
> ./oryol run CoreHello

These would be simple wrappers around the adb tool, later this should of course also work for iOS apps.

Right now the Android build system only works on OSX and only for the ARM V7A instruction set, and there’s no proper Android port of the actual code yet, just a single log message in the CoreHello sample.

Phew, that’s it! All this stuff is also available on github (https://github.com/floooh/oryol/tree/master/cmake).

It's so quiet here...

2014-02-02T16:13:00.001+01:00

…because I’m doing a lot of weekend coding at the moment. I basically caught the github bug over the holidays:

http://www.github.com/floooh

I’ve been playing around with C++11, python, Vagrant, puppet and chef recently:

C++11:

I like: move semantics, for (:), variadic template arguments, std::atomic, std::thread, std::chrono, possibly std::function and std::bind (haven’t played around with these yet)
(still) not a big fan of: auto, std containers, exceptions, rtti, shared_ptr, make_shared
thread_local vs __thread vs __declspec(thread) is still a mess across Clang/OSX, GCC and VisualStudio
the recent crazy-talk about integrating a 2D drawing API into the C++ standard gives me the shivers, what a terrible, terrible idea!

Python

best choice/replacement for command-line scripts and asset tools (all major 3D modelling/animation tools are python-scriptable)
performance of the standard python interpreter is disappointing, and making something complex like FBX SDK work in alternative Python compilers is difficult or impossible

Vagrant plus Puppet or Chef

Vagrant is extremely cool for having an isolated cross-compilation Linux VM for emscripten and PNaCl, instead of writing a readme with all the steps required to get a working build machine, you can simply check-in a Vagrantfile into the versioning system repository, and other programmers simply do a ‘vagrant up’ and have a VM which ‘just works’
the slow performance of shared directories on VirtualBox requires some silly workarounds, supposedly this is better with VMWare Fusion, but haven’t tried yet
Puppet vs Chef are like Coke vs Pepsi for such simple “stand-alone” use-cases. Chef seems to be more difficult to get into, but I think in the end it is more rewarding when trying to “scale up”

Asset loading in emscripten and PNaCl

2013-12-20T15:56:00.001+01:00

Loading data from a file on disk doesn’t look like a big deal in a normal C application:

int main() {
    // open file for reading
    FILE* fh = fopen("filename", "rb");
    if (fh) {

        // read some bytes
        char buffer[128];
        fread(buffer, sizeof(buffer), 1, fh);

        // close the file
        fclose(fh);
        fh = 0;
    }
    return 0;   
}

When doing a real-world game this simple approach has a couple of problems:

blocking: The above code is blocking, when reading from a fast hard disk this is probably not even noticeable, but try loading from a DVD or Bluray disk or some sort of network drive over a slow connection and the game loop will stutter
hard-coded paths: The concept of a current directory is often not portable, you can’t depend on the current directory being set to where your executable is. It is better to establish an absolute root location and have all filename paths in the game relative to that (of course how to establish this root location is platform-dependent again, for instance get the absolute path to the executable, and go on from there)
can’t use different transfer protocols: the above code works fine for local filesystems, but not loading data from a web- or ftp-server, and operations like creating a new file, or randomly seeking in a file may not be available with other protocols.

It is a good idea to restrict the type of file operations that a game can use, e.g.:

do we really need write and create access? An offline game may need to write save-game files and options, while an online game probably doesn’t need access to the local file system at all.
do we really need random seek? Randomly seeking in a file can be either impossible (HTTP) or slow because some mechanical device must be moved around, it’s often better to read a file straight into memory and seek there or to avoid such operations at all.
do we really need to iterate directory content? again, this can be either expensive (mechanical storage device) or impossible (in plain HTTP for instance)
do we really need free-form file paths? Games usually need to access very few places in the file system (the asset directory which is usually read-only, and maybe some sort of per-user writable location for settings and save-games)
do we really need access to file attributes? Stuff like last modification time, ownership, readable/writable. Usually this is not needed.
do we really need the concept of a “current directory”? This can be tricky for portability, and some platforms don’t have the concept of a current working directory at all

That’s a lot of features we don’t need in a game and which are also often not provided by web-based runtime platforms like PNaCl and JS. It helps to look at the HTTP protocol for inspiration, since that is where we need to load our data from anyway in the web scenario:

file system paths become URLs
only one read operation GET, which usually provides an entire file (but can also load a part of a file)
no directory iteration
no “write access” unless specifically allowed by the server
state-less, no current directory or current read position
operations can take very long (seconds or even minutes)

For a game which wants to load its asset from the web the IO system should be designed around those restrictions.

As an example, here’s an overview of the Nebula3 IO system:

all paths are URLs: Not much to say about this :)
a single root location: At application start, a root location is established, this is usually a file:// URL pointing to the app’s installation directory, but can be overriden to point (for instance) to an http:// URL. Loading all data from a web server instead of the local hard disk is done with a single line of code which sets a different root location.
Amiga assigns as path aliases: A filesystem path to a texture looks like this in N3: tex:walls/brickwall.dds, where the tex: is an “AmigaOS assign” which is replaced with an absolute path, incorporating the root directory.
all paths are absolute: there is no concept of a “current directory” in Nebula3, instead all paths resolve to an absolute location at runtime by replacing assigns in the path.
pluggable “virtual filesystem” modules associated with the URL scheme: URLs starting with file:// are handled by a different file system module than http://, plus Nebula3 apps can plug in their own filesystem modules if they want
stream objects, stream readers and stream writers: this is interesting in the web context only because there’s a MemoryStream object which is used to store and transfer downloaded data in RAM
asynchronous IO is really simple: more on that later in this post :)

Since Nebula3 is also used as a command-line-tools framework, the IO subsystem is a bit of a hybrid, which in hindsight was a design fault. There are still all these writing and file creation operations, blocking IO, directory walking etc… which makes the API quite bloated. In a new engine I would probably strictly separate the two scenarios, use the engine as a game framework only, which only supports very simple asynchronous read operations, and write the tools with another framework (or even other language, like python).

Asynchronous IO in Nebula3

Let’s look at async IO in Nebula3 a bit closer since this is the most interesting feature for web-based platforms. This is based on the “non-blocking future” pattern (or whatever you wanna call it) and depends on a frame-driven instead of event- or callback-driven application architecture.

Here’s some pseudo code:

void StartLoading() {
    // To start loading data we need to create an 
    // IO request object and "send it off" to the
    // IoInterface singleton for asynchronous processing
    Ptr<IO::ReadStream> req = IO::ReadStream::Create();
    req->SetURI("tex:walls/brickwall.dds");
    IoInterface::Singleton()->Send(req);

    // The IoRequest is now "in flight" and will contain
    // a result at some point in the future. Because we need
    // to check for completion in some later frame we need to
    // store the smart pointer somewhere
    this->pendingRequest = req;

    // ok, we're done for this frame...
}

void HandlePendingRequest() {
    // this function must be called regularly (e.g. per
    // frame) to check whether the async loading operation
    // has finished
    if (this->pendingRequest.isvalid() &&
        this->pendingRequest->Handled()) {

        // ok, the request has been completed, if 
        // the file was loaded successfully we get
        // a MemoryStream object with its content
        if (this->pendingRequest->GetSuccess()) {

            // actually load the data from the memory
            // stream and throw the request object away,
            // since all file data is in memory, we can
            // actually use the normal open/seek/read/close
            // pattern on the stream object
            this->LoadFromStream(this->pendingRequest->GetStream());

            // delete the request object, 
            // remember, this is a smart pointer :)
            this->pendingRequest = 0;
        }
    }
}

There may be less verbose or more elegant versions of this code of course, but the basic idea is that you start loading a file in one frame, and then need to check in the following frames if loading has finished (or failed), and get the completely loaded data in a memory buffer which can be parsed with “traditional” read and seek functions (and which is very fast since everything happens in memory).

This implies that the engine needs to know what to do while some required data has not been loaded yet. For a graphics pipeline this is quite simple by either rendering nothing or some placeholder while the data is still loading.

But there are cases where the code cannot progress without important data being loaded, or where it would be very tricky or impossible to implement asynchronous IO (for instance when integrating complex 3rd party libraries like sqlite).

If we could simply block this wouldn’t be a problem: the worst thing that would happen is that our game loop would stutter, but on web platforms we cannot simply block the main thread (it is easier on PNaCl where it is recommended to move the game loop into a separate thread, which then can block waiting for the main thread to process asynchronous IO requests).

For Nebula3 I fixed this with an additional application object state called the “Preloading Phase”. The idea is that the app enters this state outside of the normal game loop (for instance while displaying a loading screen), and during this state, populates a simple in-memory filesystem (basically just a lookup-table with URLs as keys and MemoryStream objects as values) with the asynchronously loaded data. When all data has been loaded (or failed to load), the app will leave the preloading phase (and hide the loading screen) and synchronous loader code will transparently get the data from the in-memory file system instead of starting an actual asynchronous IO request. Since all this preloaded data resides in memory this means of course that only small data and few files should be preloaded, and the majority of data should be asynchronously streamed on demand during the game loop. It’s really only a workaround for the few cases where synchronous access is absolutely necessary.

More details about here in one of my presentations: http://www.slideshare.net/andreweissflog3/gdce2013-cpp-ontheweb

emscripten and PNaCl details

Ok, almost done!

For the emscripten and PNaCl platforms I basically wrote a simple Nebula3 filesystem module which fires HTTP GET requests through he respective emscripten and PNaCl API calls, and copies the received data into MemoryStream objects, it’s only a few hundred lines of code each.

The main difference between the two platforms lies in the use of threading:

PNaCl works like “traditional” platforms, there are a number of IO threads (about 10, but that’s tweakable) each of them processes one IO request at a time, so that as many IO requests can be in flight as there are IO threads. Those threads also directly handle processing of the received data like decompression.
In emscripten, the IO calls (sending a HTTP request, and the callback when the response has been received) is handled on the main thread, but the expensive processing (e.g. decompression) of the received data is handed over to a WebWorker pool (usually 4 WebWorker threads). There can still be multiple IO requests in flight because the IO system doesn’t “wait” for an IO request to finish before firing a new one (but it is still throttled to restrict the number of requests in flight in case a lot of requests arrive in a very short time period).

The actual code implementation is straightforward so I’ll spare you the source code samples. The respective class in PNaCl is called pp::URLLoader, and emscripten offers a whole set of rather specialized C functions which all start with emscripten_async_wget. Both fire an HTTP request (emscripten does an XmlHttpRequest, and PNaCl presumably under the hood as well - this has some unfortunate cross-domain implications), and invoke callbacks on failure or when data has arrived. PNaCl needs a bit more coding work since data is received in chunks (and the receive callback can be called multiple times), while emscripten waits until all data is received before calling the received-callback once.

emscripten has more options to integrate the data with the web page DOM (for instance it can automatically create DOM image objects from downloaded image files), and it also has a very advanced CRT IO emulation layer (so you actually can directly use fopen/fclose after the data has been downloaded or preloaded), but I haven’t looked into these advanced concepts very closely since Nebula3 already does a lot of this layering itself.

There’s a similar filesystem layer for NaCl called nacl-mounts, but similarly to emscripten I didn’t look into this very closely since the low-level URL loading functions were a better fit for N3.

That’s it for today, have a nice Christmas everyone :)

Messing around with MESS (and JSMESS)

2013-11-03T16:11:00.001+01:00

And now for something completely different:

Since I’m dabbling with emscripten I’ve had this idea in my head to write or port a KC85/3 emulator, so that I could play the games I wrote as a kid directly in the browser. The existing KC85 emulators I was aware of are not trival to port, they either depend on x86 inline assembly, or are hardwired to a specific UI framework (if you read German, here’s an overview on what’s out there: http://www.kc85emu.de/Emulatoren/Emulatoren.htm )

About 2 weeks ago I started to look around more seriously for a little side project to spent my 3 weeks of vacation around Christmas (I need to burn my remaining vacation days, in Germany employees are basically required by law to take all their vacation - tough shit ;) My original plan was to cobble together a minimal emulator just enough to run my old games: Take an existing Z80 CPU emulator like the one from FUSE, hack some keyboard input and video output and go on from there.

Thankfully I then had a closer look at MESS. I always thought that MESS could only emulate the most popular Western game machines like the C64 or Atari 400, but it turns out that this beast can emulate pretty much any computer that ever existed (between 600 and 1700, depending on how you count), it even has support for the PDP-1 from the early 60’s! When searching through the list of emulated systems here (http://www.progettoemma.net/mess/sysset.php) I stumbled over the following entries:

HC900 / KC 85/2
KC 85/3
KC 85/4
KC Compact
Lerncomputer LC 80
KC 85/1
Z1013
Poly-Computer 880
BIC A5105

That’s the entire list of East-German “hobby computers”. But wait, there’s more:

Robotron PC-1715
A5120
A7150

These were GDR office computers. The 1715 was a CP/M compatible 8-bit PC, and the A7150 was a not-quite-compatible x86 IBM-PC clone. I’m actually not sure what the 5120 was, just that it was a big ugly box with built-in mono-chrome monitor.

Since all those systems are marked as “not working” in this list I wasn’t too enthusiastic yet, but I had to be sure. The latest MESS compiled out of the box on OSX, and it was easy to find the right ROM images in the net. So I started MESS with:

To my astonishment I watched a complete boot sequence into the operating system:

Excite!

I also came across the JSMESS project before, which is a port of MESS to Javascript using emscripten. So my next step was to compile JSMESS and see whether the KC emulator works there as well. It booted, but didn’t accept any keyboard input :( After comparing the source code it dawned on me that JSMESS was far behind MESS, about 2 years to be exact. But this was a good excuse to dive a bit deeper into how MESS actually works, and the deeper I crawled the more impressed I became.

MESS had been derived from the well known MAME arcade machine emulator project, with the goal to extend the emulation to “real computers”. Later MESS merged with MAME again, so that today both projects compile from the same code base.

A specific emulated machine is called a “system driver” and can be described by just a few lines of code listing what CPU to use, the RAM and ROM areas, what ROM image to load, and what memory-mapped IO registers exist. You’ll also have to provide several callback routines for handling reads and write to IO addresses and to convert the system’s video memory into a standardized bitmap representation. For a very simple computer built from standard chips a working emulator can be plugged together in a couple of hours, but writing a complete and “cycle-perfect” emulator is of course still a tough challenge, especially if custom chips are used. The overall amount of research and implementation work that went into MESS is almost overwhelming. Pretty much every computer, every mass-produced chip that ever existed is emulated in there, often with all of their undocumented quirks!

Ok, back to the KC85/3: after analyzing the source code of the KC driver it quickly became clear that the keyboard input emulation was the toughest part, since this was where the original KC engineers were very “creative”. As far as I understood the several pages of email exchange which are included as comment in the MESS KC driver, the KC keyboard used a very exotic TV remote control chip to send serial pulses to the main unit (the KC had an external keyboard connected with a “very thin” wire, so it was very likely a simple serial connection). The base unit which received the signal didn’t have a “decoder chip” however, but used its universal Z80-CTC (timer) and -PIO (in/out) chips to decode the signal. Emulating this behaviour seems to be very tricky since a lot of KC emulators have yanky keyboard input (not registering key presses, or inserting random key codes when typing fast, etc…).

Since I didn’t get this to work reliably even after back-porting the latest keyboard input code from MESS (which somewhat works, but still has problems with random keys triggering), I decided to be a bit naughty and implement a shortcut (the “cycle-perfect” emulator purists will likely kill me for this heresy):

After the KC-ROM reads a keyboard scan-code through this tricky serial-pulse decoding described above, it converts the scan code to ASCII and writes it to memory location 0x1FD, and then sets bit 0 in memory location 0x1F8 to signal that a new key code is available. It also maintains a keyboard repeat counter in address 0x1FA. All of this can be gathered from the keyboard handling code in ROM (and is also explained in that very informative, very long comment in the source code). I’m basically “shortcutting” this with C code and write the ASCII code directly to 0x1FD and also handle the key repeat directly in C. The tricky serial decoding stuff in ROM is never triggered this way. With this hack the keyboard input is fairly responsive (sometimes the first key is swallowed, don’t know yet what’s up with this).

Next I had to fix the RGB colors which were off both in MESS and JSMESS (bright yellow-green looked more like puke-yellowish, and all other “inbetween colors” were off too), and I finally back-ported (and also optimized a bit) the video memory mapping code from MESS to JSMESS.

You can check all my changes here on GitHub: https://github.com/floooh/jsmess/tree/floh Right now a “reboot” is going on in the JSMESS project to bring it uptodate with the latest MESS version. I’ll wait with any pull-requests until this is finished and I refreshed my own fork as well. Also I will not try to contribute my “dirty hacks” back to the main code base of course, the MESS guys are right to insist on perfect emulation instead of shortcut hacks like the keyboard hack described above. But my (rather egoistic) main goal is to get my own games running on my web page, so I think I can get away with such hacks in my own fork.

The next challenge is to get all of my games running in JSMESS. This is harder than I thought. Part of the problem is that there exist several memory dump files which are not original. I found dump files with the wrong entry address, and dumps where others have implemented cheats and trainers. So far I’ve got 3 out of 7 games working. Getting the remaining 4 games into working condition might take a while since I may have to do some hardcore assembly debugging to find out what’s wrong.

Thankfully MESS has a completely assembler-level debugger built-in:

Re-constructing the program flow of this 25-year-old game which I wrote in machine code (instead of using a assembler) is actually quite a lot of fun, much easier than trying to reconstruct a program which was written in a high-level language and compiled to machine code. Subroutines often start at “even” addresses, and have a block of NOP instructions appended, in case I needed to add instructions when fixing bugs, strings are usually embedded right into the instruction sequence instead of a central “string pool”. Analyzing the program flow comes down to figuring out what a given subroutine does (drawing a sprite? handling keyboard input? updating the hiscore display?), and what variables are stored at specific memory addresses (for instance current live counter, current position, and so on).

What’s remarkable is how small the game code actually is, even though it is not very dense with all those NOPs inbetween and a lot of redundant code segments (e.g. I didn’t specifically care about code size). Of the about 12kByte of my (very simple) Pacman clone, only about 3.5 kByte are actual code. The entire game code fits on a single screen (marked in yellow here):

Finally, here’s the current result of this work: a JSMESS KC85/3 and KC85/4 emulator, and 3 of my old games running directly in the browser. Don’t try this on an iPhone though (or generally Safari). Firefox or an uptodate Chrome works very well:

http://www.flohofwoe.net/history.html

Farewell DirectX

2013-10-08T21:04:00.001+01:00

Today I ported the OpenGL rendering code in Nebula3's bleeding edge branch back to Windows:

This is remarkable in 2 ways:

It's the first time since around 1997 that I ported a significant amount of code to Windows. Usually it was from Windows towards another platform.
This is also the end of DirectX in our code base (well almost, we're still using the xnamath.h header, which is a standalone header and doesn't require linking against any DX DLL).

Why do I think that this is remarkable:

It is the end of an era! In 1997 I ported Urban Assault from DOS to Windows95 and Direct3D5. This was just around the time when Windows started its career as a gaming platform. D3D5 was the first D3D version which didn't completely suck because it had the new DrawPrimitive API, before that, rendering commands had to be issued through an incredibly arcane "execute buffer" concept (theoretically a good idea if GPUs would have been able to directly parse this buffer, but terrible to use in real-world code). The Urban Assault port to D3D was pretty inefficient since we ported from a software rasterizer (with perspective correction and all that cool shit), and if I remember correctly we issued every single triangle through a single DrawPrimitive call (although that wasn't such a big deal at the time). And the only graphics card which had somewhat acceptable D3D support was the RIVA128 from an underdog company called nVidia (this was before their breakthrough TNT2), and the top dog was the 3dfx Voodoo2 which had much better support for Glide then for D3D. But since UA was published by Microsoft we had to be D3D-exclusive of course.

Since 1998 Direct3D was our primary rendering API, I dabbled around with OpenGL from time to time, but nothing serious. We made the jump to D3D7, D3D8, and finally D3D9. Each new version sucked less and less, and D3D9 is still a really good API. We never made the jump to D3D10 because of Microsoft's exceptionally stupid decision to not back-port D3D10 to Windows XP from Vista, and since Nebula was never about high-end rendering features but instead running on a broad range of old hardware we could never justify to add D3D10 support, since we couldn't give up D3D9.

And as silly as it sounds, this boneheaded Microsoft decision from 7 years ago is one important reason why I'm ditching D3D today. World-wide, WindowsXP is the fastest growing Windows version. It's growing a lot faster than Windows8. Don't believe me? See the Unity hardware stats page for a scary reality check:

http://stats.unity3d.com/web/index.html

The Chinese Dragon has awoken, and it is running exclusively on XP. WindowsXP is also very popular in Eastern Europe and the Middle East. So if you want to support markets east and south of Middle Europe you're basically fucked if you don't support XP.

Another important reason is streamlining the code base. The currently "interesting platforms" (browser and mobile) are all running some variant of POSIX+OpenGL. In this new world the Windows APIs are the exotics, and Microsoft doesn't exactly help the cause by repeating their errors of the past (limiting Windows Store apps to D3D11). By using a single rendering code base (and especially shader code base!) across all platforms we're reducing our technical debt in the future.

I have a fallback plan of course, because there are a few risks:

What if OpenGL driver quality on Windows is as bad as everybody says?
What if we need to support native Windows Store apps (as opposed to a WebGL version running embedded in a browser)?

The fallback plan has 2 stages:

Use ANGLE which layers OpenGL ES2 with some important extensions over D3D9 or D3D11, this is the preferred solution since we don't need to touch the render layer code and shader library.
If ANGLE isn't good enough, write native D3D9 and D3D11 ports of the CoreGraphics2 subsystem, and optimally use some API agnostic shader language wrapper. This wouldn't be as bad as it sounds, each wrapper would have around 7k lines of code, which is about 4.5% of Nebula3 in its minimal useful configuration (which is about 150k lines of code, depending on which other N3 modules are added this can go up to 500k lines of code).

OpenGL isn't perfect of course. It has some incredibly crufty corners, most of those have been fixed in through extensions and newer GL versions over time, but realistically we can't use anything newer then OpenGL ES2 with very few extensions for the renderer's base feature set.

When I removed the DirectX library stubs from the Nebula3 CMake files this afternoon I really had to stop and think for a moment. Who knows, maybe in a future blog post in about 15 years I will write "this was around the time when Windows became irrelevant as a gaming platform"? ;)

emscripten and PNaCl: App entry in PNaCl

2013-09-07T16:12:00.001+01:00

The is the followup to last week's post about application entry in emscripten. If you haven't done yet I would recommend reading this first before continuing.

2 main points to keep in mind about the (P)NaCl platform:

Blocking the main thread will block the entire browser tab.
NaCl has true threading support which can be used to workaround these blocking limitations.

Point (1) is the same as on the emscripten platform, and point (2) is the big difference to emscripten.

In a Nebula3/PNaCl application, the main function looks the same as on any other platform (I'm using emscripten's "simulate_infinite_loop" approach now):

#include "myapplication.h"

ImplementNebulaApplication();

void
NebulaMain(const Util::CommandLineArgs& args)
{
    MyApplication app;
    app.SetCommandLineArgs(args);
    app.StartMainLoop();
}

However under the hood, the startup process until the NebulaMain() function is entered is completely different from other platforms, since PNaCl doesn't have a main() function. Instead PNaCl has the concept of application Module and Instance objects. This is where the plugin-nature of a PNaCl app shines through. There is a single Module object created on a web page containing a PNaCl app, and for each <embed> element on the page, one Instance object. In reality though, most of the time there will be exactly one Module and one Instance object, so the distinction doesn't really matter.

PNaCl offers two different startup APIs for C and C++. The C++ API is easier to grasp IMHO, so I'll just concentrate on this (this dual C/C++ nature continues through the whole NaCl API, there's a pure C API, extended by a slightly higher-level C++ API.

Hooking up your code to NaCl basically means to write 2 subclasses, one deriving from pp::Module, and one deriving from pp::Instance, and the NaCl runtime will then call into these classes through virtual methods for initialisation and notifying the application about events.

But first things first:

Everything starts at a global C Function called pp::CreateModule() which you must provide, and which must return a new object of your pp::Module subclass (called N3NaclModule in this case):

namespace pp
{
    Module* CreateModule()
    {
        return new N3NaclModule();
    };
}

Although this is the very first function that NaCl will call, you should be aware that initialisers in the global scope (static objects) will already be initialised and have had their constructors called at this point.

The main job of the derived Module class is to create Instance objects, but we can also put some one-time init code in there. There's a pair of functions to initialise and shutdown GL rendering called glInitializePPAPI() and glTerminatePPAPI(). The only rule is that no GL calls must be made outside these two functions, so I guess we could also put them somewhere else, as long as is guaranteed that they are not called multiple times.

But - the most important method in the derived Module class is the factory method for Instance objects called CreateInstance. In my case, I have created a subclass of pp::Instance called NACL::NACLBridge.

The entire N3NaclModule class looks like this:

class N3NaclModule : public pp::Module
{
public:
    virtual ~N3NaclModule()
    {
        glTerminatePPAPI();
    }
    virtual bool Init()
    {
        return glInitializePPAPI(get_browser_interface()) == 1;
    }
    virtual pp::Instance* CreateInstance(PP_Instance instance)
    {
        return new NACL::NACLBridge(instance);
    };
};

All the really interesting stuff from here on happens in the NACLBrigde object.

These two source snippets live inside the ImplementNebulaApplication() macro which all in all looks like this:

...
#elif __NACL__
#define ImplementNebulaApplication() \
class N3NaclModule : public pp::Module \
{ \
public: \
    virtual ~N3NaclModule() \
    { \
        glTerminatePPAPI(); \
    } \
    virtual bool Init() \
    { \
        return glInitializePPAPI(get_browser_interface()) == 1; \
    } \
    virtual pp::Instance* CreateInstance(PP_Instance instance) \
    { \
        return new NACL::NACLBridge(instance); \
    }; \
}; \
namespace pp \
{ \
    Module* CreateModule() \
    { \
        return new N3NaclModule(); \
    }; \
}
#elif __MACOS__
...

Now on to the NACLBridge class, this is (I know I'm repeating myself) derived from the pp::Instance class, but is called "Bridge" for a reason: in the PNaCl we're spawning a dedicated thread for the game loop, and leave the main thread (aka the Pepper thread) for event handling and rendering. Our derived pp::Instance subclass serves as a "bridge" between these 2 threads, that's why it's called NACLBridge.

The NaCl runtime will call into virtual methods of an pp::Instance object for handling events, the most important of these are Init(), DidChangeView(), HandleInputEvent(). For a complete overview and exhaustive documentation of those callback methods I recommend sifting directly through the SDK header: include/ppapi/cpp/instance.h

In the Init() method I'm only building a CommandLineArgs object from the provided raw arguments (these have been extracted from our <embed> element in the HTML page).

The actual initialisation work happens (in my case) in the first call to DidChangeView() by calling a Setup() method in the NACLBridge object. I choose this place because this is where I'm getting the current display dimensions of the <embed> element, which is required for the renderer initialisation (although now thinking about it, I might also be able to extract these from the arguments provided in the Init() method, need to try this out some time).

The NACLBridge::Setup() method only does one thing: create a thread with the NebulaMain() function as entry point, and then return to the NaCl runtime. The code inside NebulaMain() works just as on any other platform, with the only difference that it is not running on the main thread, but in its own dedicated game thread.

The big advantage to run the game loop in its own thread is that you "own the game loop", and you can perform blocking, for instance to wait for IO. The disadvantage is that you can't call any PPAPI (NaCl system functions) from the game thread, which is a blog-post-topic on its own.

So to recap: The ImplementNebulaApplication macro runs on the main thread, and creates one pp::Module and one pp::Instance object. The pp::Instance object creates the dedicated game thread, which calls into the NebulaMain() function, which from that moment on runs the game loop like on any other platform. With this approach we don't need to slice the game loop into frames like on the emscripten platform.

Now that you heroically worked your way through through all of this I'll tell you a secret: NaCl also provides a simple alternative to this complicated mess called the ppapi_simple library, which essentially provides a classic main() function running in its own thread, and because blocking is allowed on this thread, also provides normal POSIX fopen()/fclose() style blocking IO functions (sound familiar?).

Check out the header file include/ppapi_simple/ps.h as starting point.

Unfortunately this ppapi_simple library didn't exist when I started dabbling with NaCl about 2 years ago, certainly would have made life a lot easier. On the other hand, the work that had already gone into the NaCl port made the emscripten port easier, which wouldn't be the case had I used the ppapi_simple wrapper code.

emscripten and PNaCl: App entry in emscripten

2013-09-01T14:21:00.001+01:00

When quickly hacking a graphics demo on the PC or consoles, the main function usually looks like this:

int main() 
{
    if (Initialize()) 
    {
        while (!Finished()) 
        {
            Update();
            Render();
        }
        Cleanup();
    }
    return 0;
}

Trying this in on one of the browser platforms like emscripten or PNaCl results in a freeze and after a little while the browser will kill your tab :(

The problem is that the browser won't "let you own the game loop", and this is a general problem of event- or callback-driven platforms (iOS and Android have the same problem for instance). On such platforms the execution flow of the main thread is not controlled by your game code, instead there's some outer event loop which will call into your code from time to time. If you spent too much time in your allotted slice of the pie you will drag the entire system event loop down and other important events (such as input events) can't be handled fast enough. Result is that the entire user interface will feel sluggish and unresponsive to the user (for instance, scrolling in your browser tab will stutter or even freeze for multiple seconds). And if you don't return for about 30 seconds, then the browser will kill your app (Aw Snap!).

This is all bad user experience of course, we want the browser to remain responsive, and scrolling smooth all the time, also during initialisation and load time.

The core problem is that your code must always return within a few milliseconds back to the browser (e.g. 16 or 33, depending on whether you're aiming for 60 or 30fps), and this is the big riddle we need to solve for a game application running in a browser.

For a Flash or Javascript coder, or someone who's mainly writing event-driven UI applications this will all be familiar, they are used to have all their code run inside event handlers and callbacks, but typical UI apps usually don't need to do anything continuous. Event-driven applications sleep most of the time, react to (mostly input-) events from the outside, and go to sleep again. But games need to do continuous rendering, and thus are frame-driven, not event-driven, and mixing these two programming models isn't a very good idea because its hard to follow the code-flow. The usual way to implement games on event-driven platforms is to setup a timer which calls a per-frame callback function many times per second. I think hacks like this is why game programmers have a deep hatred for UI-centric platforms (and why I still like Windows despite its other shortcomings, because the recommended event handling model in Windows for games (PeekMessage -> TranslateMessage -> DispatchMessage) actually lets you "own the game loop" in a very simple and elegant way through message polling).

There are a few different approaches to either get a true continuous game loop, or at least to create the illusion of a continuous game loop on platforms where polling isn't possible, mainly depending on whether "true" pthreads-style multi-threading is supported or not.

In a Nebula3/emscripten application this isn't the case, the actual game loop and the rendering code runs on the main thread. Reason for this is that emscripten's multithreading support is built on WebWorkers. pthreads emulation isn't possible in emscripten since WebWorkers can't share memory with the main thread, furthermore, WebWorkers can't call into WebGL. This puts a lot of restrictions on our "game loop problem", and it required to refactor Nebula3's application model: in all previous ports there was always a way to somehow run a continuous game loop, mostly by moving the game loop into its own thread, but we don't have this option in emscripten (yet ... but hopefully one day, with more flexible WebWorkers).

Traditionally, a Nebula3 application used to go through a simple "Open -> Run -> Close -> Exit" sequence. An N3 main file looked like this for instance:

#include "myapplication.h"

ImplementNebulaApplication();

void
NebulaMain(const Util::CommandLineArgs& args)
{
    MyApplication app;
    app.SetCommandLineArgs(args);
    if (app.Open())
    {
        app.Run();
        app.Close();
    }
    app.Exit();
}

Instead of a main() function, there's a NebulaMain() wrapper function and a macro called ImplementNebulaApplication(). These hide the fact that not all platforms have a standard main() (for a Windows application, one would typically use WinMain() for instance).

The actual system main function is hidden inside the ImplementNebulaApplication() macro, for a PC-like platform the macro code looks like this:

int __cdecl main(int argc, const char** argv)
{
    Util::CommandLineArgs args(argc, argv);
    return NebulaMain(args);
}

Now back up to the NebulaMain() function's content: the Application::Open() method could take a while to execute (couple of seconds, worst case), and the Application::Run() will contain the "infinite" game loop, which only returns when the application should quit.

Since this wasn't a very good fit for the emscripten platform (because of this "infinite" loop inside the Run() method), first step was to make the app entry even more abstract to give the platform-specific code more wiggle room:

#include "myapplication.h"

ImplementNebulaApplication();

void
NebulaMain(const Util::CommandLineArgs& args)
{
    static MyApplication* app = new MyApplication();
    app->SetCommandLineArgs(args);
    app->StartMainLoop();
}

The most obvious change is that there's only a single StartMainLoop() method instead of the Open->Run->Close->Exit sequence. And at closer inspection some strange stuff is going on here: The application object is now created on the heap, the pointer to the object lives in the global scope, and the app object is never deleted. WTF?!?

To understand what's going on we need to dive a bit deeper into the emscripten system API.

The StartMainLoop function is actually only a one-liner on the emscripten platform:

emscripten_set_main_loop(OnPhasedFrame, 0, 0);

This sets the per-frame callback (called OnPhasedFrame) which the browser runtime will call regularly, and we'll have to do everything inside this callback function. The first 0-arg is the intended callback frequency per second (e.g. 60). 0 has a special meaning: in this case emscripten is using the modern requestAnimationFrame mechanism to call our per-frame function (instead of of the old-school setInterval or setTimeout way). The second argument is called simulateInfiniteLoop, and to understand what this does it is first necessary to understand what happens when it is not used:

The emscripten_set_main_loop() function will simply return, all the way up to main(), which will also return right after it has started! WTF indeed...

In a normal C program, returning from the main() function means that the program is shutting down of course. Local-scope objects will be destroyed before leaving main(), then global-scope objects (static initialisers).

In emscripten's case, a program which has called emscripten_set_main_loop() continues to run after main() has returned. This is a bit of a strange design decision, but makes for familiar looking code (e.g. hello_world.cpp is the same as on any other platform). Objects in the global scope will continue to exist in emscripten after main() returns, but objects in the local scope of main() will be destroyed, thus this strange way to create our application object, to prevent the app object from being destroyed after main() is left:

    static MyApplication* app = new MyApplication();

And now back to that simulate_infinite_loop argument: This is a new argument which was introduced after I started the Nebula3 emscripten port. Setting this argument to 1 will cause the emscripten_set_main_loop() function to not return to the caller, instead a Javascript exception will be thrown which essentially means that execution bails out of the C/C++ code without unwinding the (C/C++) stack, thus leaving local-scope objects of the main() function alive, everything after emscripten_set_main_loop() will never be called. So with this fix we could just as well write:

void
NebulaMain(const Util::CommandLineArgs& args)
{
    MyApplication app;
    app.SetCommandLineArgs(args);
    app.StartMainLoop();
}

Which looks a lot more friendly indeed.

So this basically covered emscripten's application startup process, we now have a per-frame function (called OnPhasedFrame) which will be called back at 60 fps. We just need to cram everything the application has to do into these 1/60sec time slices. This is fine for the actual game loop after everything has been loaded and initialised, but can be a problem for stuff like loading a new level, which could take a couple of seconds. In a traditional game, worst thing that could happen in this case is that the loading screen animation (if there is any) may stutter, but in a browser environment, such pauses will affect the entire browser tab (freezing, no scrolling, etc...), which makes a very bad first impression to the user.

So what to do? For Nebula3 I created a new Application base class called "PhasedApplication". Such a phased application goes through different life time phases (== states), such as:

Initial     -> app has just become alive
Preloading  -> currently preloading data
Opening     -> currently initializing
Running     -> currently running the game loop
Closing     -> currently shutting down
Quit        -> shutting down has finished

Each of these phases (or states) has an associated per-frame callback method (OnInitial, OnPreloading, OnOpening, etc...). The central per-frame callback will simply call into one of those methods based on the current phase/state. Each phase method invocation must return quickly (the browser's responsiveness depends on this), and may be called many times until the next phase is activated. So instead of doing a lot of stuff in a single frame, we do many small things across many frames.

Best example to illustrate this is the OnOpening() method. Suppose we need to do a lot of initialisation work during the apps Opening phase. Files need to be loaded, subsystems must be initialised and so on. This may take a couple of seconds. But the rule is that we must ideally return within 1/60sec, and we also don't have an independent render thread which could hide the main-thread freeze behind a smooth loading animation. So we need to do just a little bit of initialisation work, possibly update rendering of the loading screen, and return to the browser runtime. But since we haven't switched to the next state yet, OnOpening() will be called back again, and we can do the next piece of initialisation work. Sounds awkward of course, and it is, but there's not a lot we can do about it.

A new Javascript concept called generators could help to clean up this mess, with these it should be possible to chop a long sequence of actions into small slices while leaving the function context intact (essentially like a yield() function in a cooperative multithreading system) - catapulting Javascript into the illustrious company of Windows1.x and Classic MacOS. But enough with the ranting ;)

A somewhat cleaner method for long initialisation work is starting asynchronous actions through a WebWorker job in the first call to OnOpening() and during the next OnOpening calls check for all of those actions to have finished, gather the results, and finally switch to the next state, which would be Running. In the worst case, initialisation code must literally be chopped into little slices running on the main thread.

So that's it for this blog post. Originally I wanted to compare emscripten's and PNaCl startup process, but this would be way too much text for a single posts, so next will very likely be a similar walk through of the PNaCl application start, and after that the next big topic: how to handle asset loading.

emscripten and PNaCl: Build Systems

2013-08-26T22:32:00.001+01:00

I recently ported Nebula3 to Google's PNaCl. Main motivation was that I wanted to see how it compares to asm.js both for performance and "ease of use". This was basically a drive-by port, I didn't want to put too much effort into it. Thankfully I had old NaCl code lying around which I could reuse and after 2 or 3 afternoons (and some WTF-moments) I had a pretty clean port running which I'm planning to keep updated into the foreseeable future.

The big news about PNaCl is that deployment no longer has to go through the Chrome Web Store, instead it is now finally possible to host PNaCl applications from any URL.

You can check out the Nebula3 PNaCl demos here: http://www.flohofwoe.net/demos.html. Just make sure you're running the latest Google Chrome Canary, and if an error pops up that PNaCl isn't enabled, just restart Chrome, and wait a little bit. First start can take up to one minute, since PNaCl support is installed on demand which is a multi-MByte download.

Over the next few weeks I'm intending to write up a little series of blog posts comparing the PNaCl and emscripten Nebula3 ports. From a coder's perspective, the two systems are actually fairly close when seen from high above.

As a "pragmatic programmer", I don't really care about the political side. Both asm.js and PNaCl had to take a lot of flak from web purists. The only thing that counts to me is that both technologies provide a seamless software distribution channel directly from the coder to the user. No app shops, gate-keepers, code-signing-certificates or approval processes inbetween.

The Build System

First step is of course to get the SDKs. Both emscripten and PNaCl offer a GCC-style cross-compiling toolchain based on Clang-LLVM. Quick disclaimer: I'm running on OSX, haven't looked at the Windows side of things yet.

The emscripten SDK is simply installed and updated through a github repository. There's a stable master branch, and a bleeding-edge incoming branch. emscripten requires a couple of external tools, most notably Clang-LLVM, python and node.js. Even though clang is the standard compiler on OSX I installed a separate version because emscripten required a newer version then was installed on OSX 10.7. Paths to external tools must be provided through a .emscripten config file in your home dir.

The NaCl SDK is a normal download-archive which should be unzipped to a nacl_sdk directory in your home directory. This download only contains a script file called "naclsdk" which takes care of downloading and updating the actual SDK files in the future. The NaCl SDK contains versioned bundles, each of which is actually a complete SDK in itself, with tools, headers, libraries and examples. This is the same philosophy as the DirectX SDKs. You pick a version to work with and decide yourself when to switch to a newer version, this guarantees you a stable API, and gives the dev team the freedom to change APIs in new versions without breaking code compiled against older versions.

One challenge about the NaCl SDK is to find the right compiler tools and runtime libs since there are so many choices. The "classic" CPU-specific NaCl had different toolchains for ARM and Intel CPU architectures, and two different C runtime libs to choose from: newlib or glibc.

PNaCl is much simpler though: there are no longer different target CPU architectures since PNaCl executables are essentially LLVM bitcode, and the only available C runtime lib is newlib (which is the better choice anyway, since it is much slimmer then glibc).

In Nebula3 I'm using cmake to generate build files for different target platforms and build systems / IDEs. For each platform, you build a so called toolchain file which contains paths to the cross-compiling tools, search paths to headers and libraries, and compiler/linker settings.

Writing such a toolchain file can be a bit of guess work, but there are examples flying around the net, also emscripten comes with sample cmake toolchain files which might be helpful as a starting point.

Here are a couple of tips which might save you a some trouble:

don't set "ld" as the linker tool, in both toolchains the normal compiler tool also serves as linker (in emscripten this is emcc, in PNaCl use pnacl-clang++
PNaCl requires an additional post-build-step after linking, called pnacl-finalize, cmake has the add_custom_command macro for this

To properly separate the different build files I have a directory structure like this:

nebula3/
    code/
    cmake/
        emscripten_asmjs/
        emscripten_debug/
        pnacl_release/
        pnacl_debug/

All the source code lives under /code, and all the build files are generated under cmake/ with one directory per target platform and build configuration.

To actually generate the build files, I have a couple of shell scripts under /code which invoke cmake like this:

cd ../cmake/emscripten_asmjs
cmake -G "Eclipse CDT4 - Ninja" -DCMAKE_BUILD_TYPE="AsmJS" -DNEBULA_PLATFORM=EMSCRIPTEN -DCMAKE_TOOLCHAIN_FILE="../../bin/emscripten.toolchain.cmake" ../../code

The -G option is the cmake "generator", we're telling cmake here that we want Eclipse project files using the ninja build tool (ninja is a more modern make alternative). *-DCMAKE_BUILD_TYPE* sets the AsmJS build config (cmake lets us define any number of custom configs, commonly just Release and Debug but in emscripten I have defined an extra AsmJS config), then -DNEBULA_PLATFORM=EMSCRIPTEN is one of our own custom symbol definitions, this simply tells our cmake files, that we're building for the emscripten target platform (actually this is redundant, a better place for this definition would be the toolchain file). Next we tell cmake which toolchain file to use, and finally where the source code is located (or more specifically: where to find the root CMakeLists.txt file - CMakeLists.txt files tell cmake what targets to build, and from what sources).

When cmake has run, we could import the generated project into Eclipse, or we can just run ninja from the command line:

Writing a proper cmake based build environment can be a lot of work, but it is definitely worth it. Managing a multi-platform build environment across Linux, OSX and Windows and probably several game consoles, spanning different IDEs like Visual Studio, Xcode and Eclipse would be a nightmare without a meta-build-tool like cmake.

Deployment

Big jump here, but no worries, I'll deal with all the inbetween-stuff in the following blog posts.

The common thing between emscripten and PNaCl when deploying is that the generated files are embedded into a web page, and thus can be easily integrated into existing web site build- and deployment-processes.

The details are a little bit different between the two though:

An emscripten "executable" is either a .js file or a complete HTML page (the so called shell page) which embeds the generated Javascript code. The emscripten linker looks at the output file extension to decide whether it should generate a .js or .html file. Emscripten comes with a default html shell file which should be used as starting point for a customised web page.

Integrating emscripten generated code into a web page is just the same as integrating any piece of complex Javascript code. Since emscripten-generated code is just Javascript, it is also very easy to interact with the rest of the page through direct JS function calls.

PNaCl on the other hand integrates like a plugin into the HTML page using the embed element:

<embed src="dragons.nmf" class="pnacl" id="pnacl_module" name="pnacl_module" width="800" height="452" type="application/x-pnacl"/>

Instead of the .pexe file, a .nmf manifest file is given to the embed element which contains the name of the .pexe file (this manifest file used to look more interesting in classic NaCl since it contained one entry for each target cpu architecture, but for PNaCl there's only one useful piece of information):

{
    "program": {
        "portable": {
            "pnacl-translate": {
                "url": "dragons.pexe"
            }
        }
    }
}

Finally, the type="application/x-pnacl" attribute is important for Chrome to recognise the embed element as a PNaCl application.

Interaction between a PNaCl application and the surrounding web page works through the Javascript messaging system. To get events from the PNaCl application, just add event listeners to the embed element:

<script type="text/javascript">
    // ...
    var naclModule = document.getElementById("pnacl_module");
    naclModule.addEventListener('loadstart', handleLoadStart, true);
    naclModule.addEventListener('progress', handleProgress, true); 
    naclModule.addEventListener('load', handleLoad, true);
    naclModule.addEventListener('error', handleError, true);
    naclModule.addEventListener('crash', handleCrash, true);
    naclModule.addEventListener('message', handleMessage, true);
    // ...
</script>

The other way around works as well, by sending messages to the PNaCl app through postMessage.

The End

Ok, that's it. Next up I'll go through the changes to the Nebula3 Application Model which were necessary for the web platforms!

Entity-Component-System Revisited

2013-07-06T21:00:00.001+01:00

This old blog post about the Nebula3 Application Layer is the 3rd-popular-post on my blog, very likely because it was linked from Stack Overflow. I always wanted to write a followup to this post, because if I would design such a system again, it would look quite differently today.

First a quick recap of the original system:

the original system consists of the following classes:
- Entity: a container for Properties and Attributes, can receive Messages which are distributed to its Properties
- Property: attached to an Entity, implements some part of the entities "game logic", receives and processes messages
- Message: a small object which is sent to an Entity and distributed to Properties which may handle them
- Attribute: key/value pairs attached to entities
- Manager: singletons which implement global game logic
- the only pre-defined Manager is the EntityManager which is a container for Entities, and allows to query for entities
- Entities and Properties have several per-frame callbacks and are called back by the EntityManager
the motivation behind this system:
- to have a simple, extensible high-level framework for implementing game-play logic
- fix extension-through-inheritance problems through composition
and the problems of the original system:
- poor spatial locality: Entities, Properties and Messages are isolated heap objects and can be spread all over the address space in the worst case
- high cost for creation and destruction: all objects are dynamically allocated, this is especially a problem for Messages, there may be thousands of Messages created and destroyed per frame
- high cost for settings/getting Attributes: setting or getting an attribute value involves a O(log2 n) lookup
- high overhead for on-frame callbacks: the EntityManager calls several callbacks every frame on each entity, with many entities the call-overhead is non-trivial
- reliance on virtual methods: almost all public methods in properties are virtual, because the message handler and callback methods are implemented in a Property base class, with specialised properties as subclasses

In the old single-player Drakensang games we had up to two-thousand game entities in some bigger maps, and we ran into real performance problems because the entity system is so heavy-weight.

So here's how I would implement a similar system today, keep in mind that this is just a "Gedankenexperiment", and I will make up some stuff while I type (but most of it has been lingering in the back of my head for quite a while now).

The main goals are to improve performance by making the system less dynamic, reduce memory fragmentation and reduce message-passing and object creation overhead.

Here we go:

1. Move all the interesting code into separate subsystems

In the original entity system, Managers and Properties would often implement actual game logic, and could become big, complicated and unwieldy.

The new entity system would only be minimal glue code between (ideally autonomous) subsystems, each with a Facade singleton as its main public interface. Such subsystems could be: rendering, AI, physics, audio, and also anything else what makes up the game. The last point is important: Even when already using such autonomous subsystems for low-level stuff like rendering or audio it is tempting to write the actual game logic "along the way" inside Properties without separating it into additional "game logic subsystems", which is guaranteed to soon end in an unmaintainable mess.

Ideally, each of the autonomous subsystems can live (and be tested) on its own, and will not interact with other subsystems (the physics world must not know about the rendering world or the audio world and so on).

One of the main jobs of the entity-component-system is to control and coordinate the data flow between those autonomous subsystems, it glues the subsystems together (e.g. getting the desired motion from the AI/navigation system into the physics system, and getting position updates from the physics system into the rendering system).

The other job is to provide different types of game objects (for instance different unit types in a strategy games) by combining small, reusable Component objects which implement different aspects/behaviours of the game logic.

The important thing to keep in mind is that all the classes of the new entity system will only provide a slim layer of glue between subsystem which contain all the meaty stuff.

What's in the new entity system

Properties will now be called Components, but their role will be the same. Managers and Attributes will go away (reasons are detailed below). Entities and Messages will keep their names and roles.

Fixing the Spatial Locality and Cost of Creation

Entities and Components would be created from pre-allocated object pools. Live Entities and Components would ideally be located next to each other without big memory holes inbetween. As public handle to an Entity I would probably use an EntityID instead of a (smart) pointer, the EntityID would be a 32-bit integer, some bits used as index into the entity pool, and some bits as a unique wrap-around-counter to prevent that an old Id points to a recycled object in the pool.

Entities and Components

An Entity would be a template class which must be partly implemented by the game programmer tailored to his project. The max number of Components the Entity can hold is a template parameter. There's a private C array of raw pointers to Components contained inside the Entity class, and programmer-provided template-methods to gain safe access to those Component objects.

An example: let's say the components-access template method would be called Component(), then invoking a method "SetTransform()" on a component "Location" would look like this:

entity->Component<Location>()->SetTransform(m);

Hmm, this looks mighty ugly though... The advantage is that the Component<> method will resolve to a simple inlined pointer indirection, which is as cheap as it gets. But I will have to think of some nicer looking code...

Attributes

Attributes will very likely go away completely because the cost for setting/getting is too high (this involved a binary search). Instead entity state will be exposed through simple inline getter methods in Component classes. There are not setter methods, because direct, unchecked manipulation of internal entity state by an "outsider" would be too dangerous. Manipulating an entity is exclusively done by sending messages to the entity.

There must still be a more dynamic, generalised way to initialise and manipulate an entity (this was a nice side-effect of the general attribute-system), for instance to implement persistence or communicate with remote applications (like a level editor). For this, some general serialisation mechanism to and from a simple binary stream must be implemented.

The Entity Registry

This would be a singleton used as factory and container of entities (basically the facade of the entity system). It would allow creation of entities, resolve an EntityID into a pointer, probably lookup entities by name (if having human-readable entity names makes sense at all), and sending messages to entities. This would be similar to the old EntityManager, but it would not call any per-frame methods on entities (it would be desirable if the new entity-system wouldn't any type of per-frame-tick at all).

Components and Messages

Sending a message to an entity should not involve creating a message object, instead a message is just a simple, short-lived stream of plain-old-data bytes in some hidden memory buffer. There will be a unique message type identifier, which is a simple 32-bit integer value (or maybe an enum) at the front of the byte stream.

Messages are processed by Component objects, which can subscribe to specific message types at the central EntityRegistry by associating a message type with a handler method:

entityRegistry->Subscribe(msgType, componentType, methodPtr);

A message is sent to one or more entities through the central EntityRegistry by calling one of several "PushMsg" template methods which accept a variable number of arguments. Each combination of arg types will resolve to a template specialisation under the hood. The advantage is again, that none of this involves expensive "dynamic" code, each specific message signature will resolve to a piece of code which is very likely inlined and just consists of writing values to memory:

entityRegistry->PushMsg(entityId, msgType, arg0, ...);

This will write the args to an internal memory area (with proper alignment), and and call the handler method of subscribers, which will be provided with some sort of pointer to the start of the arguments, read/decode the arguments and perform some action with them. The disadvantage here is that there's no type-safety for the message arguments. If the caller and handlers don't agree about the order and types of arguments bad things will happen at run time, so it might still be better to use simple message classes instead of multiple typed arguments:

MyMsg msg(x, y, z);
entityRegistry->PushMsg(entityId, msg);

This would have the overhead of an extra object created on the stack (still better then on the heap), and would involve defining dozens or hundreds of message classes which would only consist of setters and getters, this should be a job for a code generator (we have something similar already called NIDL files, which are used to generate C++ message classes from a simple XML description). The advantage is type-safety and automatic agreement between sender and handler about the message arguments, plus the message class constructor can setup default argument values.

The default PushMsg() method will probably call the subscribers immediately. It might be desirable to also have deferred message handling, where the sender defines a time in the future when the message should be handled. It might also be possible to use this mechanism to send messages between remote objects across threads, processes and physical machines, but this might go a bit too far.

What about the Managers?

Managers don't really have a place in the new entity-system. Their role is taken over by the Facade singletons of the autonomous subsystems.

Conclusion

I think the original ideas behind the Nebula3 Application Layer as a flexible Entity-Component-System still make a lot of sense for a high level game framework, but today I look at the original implementation as too "heavy-weight" both in design and implementation. If I were to rewrite the system (and I'm tempted, but other stuff has higher priority) I would start as described here. What the end-result would look like is on another page, I tend to restart such systems from scratch several times if the code "doesn't look right" :)

Sane C++

2013-06-21T22:34:00.001+01:00

TL;DR: An attempt to outline the 'good parts' of C++ from my experience of porting Nebula3 to various platforms over the years. Some of it controversial.

Update: some explanation why STL and C++11 is currently "forbidden", see below!

C++

...is relatively famous for how easy it is to shoot yourself in the foot in many interesting ways. The types of bugs which are simply impossible in other languages is legion.
So then, why is C++ so damn popular in game development? One of the most important reasons (IMHO) is that C++ allows to write very high-level and very low-level code. If needed, you can have full control over the memory layout, when and how dynamic memory is allocated and freed, and how exactly memory is accessed. At the same time you can write very clean and high-level code with the right framework and don't care about memory management at all.
Especially the significance of low-level programming, e.g. controlling the exact memory layout of your data is often ignored by other, higher level languages, even though it can have a dramatic effect on performance.
One of the most common C++ newbie errors is to tackle a big software project without a proper high-level "toolbox". C++ doesn't come with a luxurious standard framework like all those fancy-pancy modern languages.
And with only hello_world.cpp under their belt newbies quickly end up with this typical mess of object ownership problems, spaghetti-inheritance, seg-faults, memory leaks and lots of redundant code all over the place after just a few ten-thousand lines of code.
On the other hand, it is incredibly easy to write really slow code in a high-level environment since you don't really know (or need to care) what's going on under all those layers of convenience.
The most important rule when diving into C++ is: Know when to write high-level and when to write low-level code, these are completely different beasts!
So what's the difference between high-level and low-level C++ code? I think there's no clear-cut separation line, but a good rule of thumb is: if it needs to run a few thousand times per frame, it better be really well optimised low-level code!

If you look at a typical rendering pipeline, there's this typical cascade where every stage in the pipeline is executed at least an order of magnitude more often then the previous one: outer-most there's stuff that happens only once per frame, next code is executed once per graphics object, then once per bone/joint, then per vertex, and finally per pixel. The realm of low-level code starts somewhere between per-object and per-bone (IMHO).
Typical high-level code to me is "game play logic". This is also were thinking object-oriented still makes the most sense (as opposed to a more data-oriented approach). You have a couple of "game objects" which need to interact with each other in fairly complex ways. On this level you don't want to think about object ownership or memory layout, and high-level concepts like events, delegates, properties etc... start to make sense. Shit starts to hit the fan when you have thousands of such game objects.
It is of course desirable to get the performance advantages of low-level code combined with the simplicity and convenience of high-level code. This is basically the holy grail of games programming. Hiding complex or complicated code under simple interfaces is a good start.

Ok, so before I drift completely into the metaphysical, here's a simple check-list:

Forbidden C++:

This stuff is completely forbidden in our coding-style:

exceptions
RTTI
STL
multiple inheritance
iostream
C++11

That's right, we're not using C++ exceptions, RTTI, multiple inheritance or the STL. C++11 is pretty cool, but still too fresh. Most of these restrictions will make your multiplatform-life a lot easier (and not much of importance is lost IMHO).

Update: I should have explained why the STL and C++11 is on this list. First the STL: Historically the STL came with a lot of problems because quality differed between compilers a lot, porting to non-PC platforms was difficult if your code depended on STL, and I am reluctant to include more complex dependencies into the engine (like boost for example). Today STL implementations are much better, so on most platforms this is probably no longer an issue.

Personally, I think the STL is an ugly library, *at least* the container classes. You'll have to admire its orthogonality and flexibility, but in reality one project ever only needs 3 or 4 specialisations. What we did was write a handful of container classes (Array, Dictionary, Queue, Stack, List) in the spirit of C#'s container classes (those are probably not as flexible as STL conteiners, but they do look nicer, and the generated code should be the same in most cases). Beautiful looking source code is important I think. This may all change with C++11 though. C++11 is extremely cool, but I think it is too early still to jump on if we need cover a lot of platforms. But C++11 together with the STL is much more powerful then those two alone, so I will very like revert my stance on STL once we switch to C++11.

But I think this switch should be done throughout the entire engine (starting at the core with the new move semantics which are really useful for containers, to the new threading support, lambdas, function objects and so on), so switching to C++11 will involve a major rewrite of Nebula3, maybe even justify a major version number switch. I think it doesn't make sense to sprinkle bits and pieces of C++11 and STL here and there into the code

Tolerated C++:

Use with care, don't go crazy:

templates
operator overloading
new/delete
virtual methods

Templates are very powerful, they can make your code both more readable, AND faster because more type information is known at compile time. But you really need to keep an eye on the generated code size. Don't nest them too deeply, and keep it simple.
Operator overloading is restricted to very few places (containers and items in containers). We're NOT having operator overloading in our math library. dot(vec,vec) is much more readable then vec*vec.
Not using new/delete in C++ code sounds a bit crazy, I know. But most of the time where you need to create an object on the heap you'll also want to hand its pointer somewhere else, which quickly introduces ownership problems. That's why we're using smart pointers to heap objects which hide the delete call. And since a new without its delete looks a bit silly, we're also hiding the new behind a static Create() method. It's better to avoid heap objects altogether though, especially in low-level code.
Virtual methods are important of course, BUT: Just spend a second to think about whether a method really must be virtual (or more importantly: do you really need run-time polymorphism, or is compile-time polymorphism enough?). The more "static" your code is, the more optimisation options the compiler has.

Forbidden C:

Some unusual stuff here as well:

all CRT functions like fopen() or strcmp() are forbidden, except the math.h functions
directly calling malloc()/free() is forbidden

Most of the CRT functions are straight out terrible (strpbrk, strtok, ...) and/or dangerous (strcpy), so we're wrapping them all away and/or use better platform-specific functions under the hood (this can also reduce executable size, which is always good).
Overriding malloc/free with central wrapper functions is really useful once you need to do memory-debugging and -profiling, also makes it easier to try out different memory allocator libs.

Tolerated C:

Some "dangerous" stuff is only allowed in performance-critical low-level code:

raw pointers and pointer arithmetics
raw C arrays
raw memory buffers

These are all recipes for disaster in the hands of an unexperienced programmer (or an experienced programmer who needs to juggle too many things in his head). Instead of pointers, use smart-pointers to refcounted objects (see above), or indices into containers. Instead of raw arrays use containers. Never directly allocate and access memory buffers in high-level code.
All of these "dangerous techniques" are essential for really performance-critical low-level code though, but this is only at a handful places in the code, and when the really mysterious kind of crashes happen, at least you know where to look.

The End

One last point: our code is riddled with asserts which are also enabled in release mode (hardly makes a performance difference, but the uncompressed executable size is up to 20% larger because of the expression strings, thankfully those strings compress very well).
The essential, must-have assert checks are for invalid smart pointer accesses (null pointers), boundary checks in container classes and checking for valid method parameters.
With all of the above, we're rarely ever hitting a seg-fault (maybe twice a year on the server-side). If something breaks, then it is very likely an assertion check which got hit, and this is usually very easy to post-mortem-debug since it comes with a call-stack and method signature.

Minor demos and web page update

2013-05-04T16:58:00.000+01:00

Couple of minor changes at http://www.flohofwoe.net:

I have removed the non-asm.js demos. Since the asm.js code generation in emscripten is now always faster then the "traditional" code generation, it doesn't make sense to have the non-asm.js code around. I'll keep support for the old code-generation in my build pipeline for now, to be able to run comparisons between the new and old code from time to time though.
The demos are now compiled with link-time-optimization enabled. Previously this had subtle and hard to debug code generation problems, but it looks like this is fixed now (fingers crossed). Performance or code size doesn't seem to be different that much however.
Demos have been recompiled with the latest emscripten incoming branch.
I added experimental support for uncompressed textures if the WebGL implementation doesn't support DXT textures (e.g. mobile platforms). This will decompress textures on the fly after download. For now this is just a workaround/hack and hasn't been tested that much. Also, since uncompressed textures are 4..8x bigger, this isn't really useful for complex games.
I have added a high-level source code page for people who like to read some code: http://www.flohofwoe.net/sources.html
Finally, http://n3emscripten.appspot.com will no longer be updated, and I've put a link to the new demos there.

-Floh.

Quo Vadis Talk, New Demo Place

2013-04-25T18:05:00.001+01:00

Quick update:

Just came back from Quo Vadis 2013 in Berlin where I talked about "C++ on the Web" in front of a crowded room (thanks to all who've been there :), the slides are here:

http://de.slideshare.net/andreweissflog3/quovadis2013-cpp-ontheweb

And I have moved the Nebula3/emscripten demos to my own web site here:

http://www.flohofwoe.net/demos.html

The demos at the old appspot.com URL haven't been updated in a while. When I get around it I'll redirect to the new demo page from there.

Over and out :)
-Floh.

Why I spend my precious spare time with emscripten

2013-03-22T16:04:00.002+01:00

I recently realized that I have spent much more time with emscripten then any other "weekend project" so far. At least the emscripten-based demos became the most advanced on any of my spare-time coding platforms in the past 2 years like iOS, Android, Google Native Client, flascc.

I think it comes down to "open, free and painless", for spare-time projects these are all extremely important points. I want to spend my free time with stuff that is fun.

Let's look why the other stuff isn't as much fun:

iOS: The tools you need for development are all free, XCode is a very slick IDE to work in, and unlike VisualStudio there's no artificial distinction between a (feature-cut) free and a (pricy) professional version. So far so good. The pain starts when you want to run your code on your actual iOS device. Welcome to provisioning profile hell. First you need to hand over $99 per year for the privilege to run your own code on you own hardware, but that's the least of it. Next you need to create "provisioning profiles" on Apple's developer portal, registering each team member, device and application and set up who may do what. In the end you essentially get per-app/per-device code-signing-certificates which expire every three months. So all the iOS demos which I did 2 years ago don't work anymore unless I go through all that hell again. Nope.

Android: Android C++ development sucks, plain and simple. It's a pain in the ass to set up (it's less painful if you use nVidias ready-made installer), remote debugging a native app is so slow it's essentially useless, and you can't use the cool new stuff since most of the world is still running an Android version from the stone age. To be fair, this was all 1.5 years ago, but I have little motivation to waste further weekends in finding out whether things have improved since then ;)

Google Native Client: The main reasons why I stopped dabbling with Native Client is that it is still not opened up (only works with Chrome Web Store bundled apps), and pNaCl seems to take forever to be finished. To be fair, Native Client has very good middleware support (like FMOD or RakNet), but it doesn't look like it will ever be implemented outside of Chrome.

flascc: I played around with flascc for a weekend or two, 2 main reasons why it didn't set my heart on fire: (1) Compiling/linking is extra-ordinarily slow AND/OR uses infinite amounts of RAM. For reasonably big code bases (like Nebula3) it's unusable because my 4GB Mac simply ran out of memory. (2) since working with flascc is so damn slow I wasn't motivated to actually go on with writing a Stage3D wrapper for N3's rendering layer.

So all in all, emscripten is the most frictionless way to write and and actually publish 3D demos for me. I can host the demos wherever I want, update them without a certification or signing process getting in the way, the demos won't expire, they are automatically multi-platform and finally, there's no vendor or platform lock-in. Most of the code I'm writing is platform-agnostic C++ and will compile and run anywhere, and the host platform's "API foot print" is minimal: a subset of POSIX and OpenGL, which will also compile almost anywhere else with minimal changes.

Updated Nebula3/emscripten Demos

2013-03-18T23:15:00.000+01:00

Update 3: I replaced SQLite with a TableData addon, this reduces the map-viewer-demo size from 8 MB down to 5 MB (uncompressed), and reduces startup time dramatically.

Update 2: Demos should now properly work on all WebGL configs again (which support DXT textures to be exact). I've been using more then 254 vertex shader uniforms, and at least ANGLE restrict this number even if the GPU could actually handle a lot more).

Update: Demos don't work on Windows and some other configs since one of the new GLSL shaders doesn't compile. Tested configs are: OSX 10.7.5 with GeForce 9400M, Intel HD3000, HD4000 and Radeon HD 6770M. Fix is coming later today.

Finally a new demo update! If you're a Chrome user, please be aware that you need to run these demos in the very latest Chrome Canary (Version 27.0.1444.3 canary) since this contains a bugfix in the V8 Javascript engine (details are here: https://code.google.com/p/chromium/issues/detail?id=177883). This bug was also the reason why I held back updates for so long, I couldn't overwrite the version which reproduces this bug, but I also didn't feel like setting up yet another AppEngine project.

Updated demos are here: http://n3emscripten.appspot.com

The DSO map viewer demo is now much closer to the actual map renderer of the Drakensang Online client:

The ground-decals system has been moved over which helps a lot in hiding the tiling structure of the level. The rendering pipeline now includes posteffects like bloom and color-balancing. You're now controlling a "player character", and I added a few more "NPCs" to the map in order to check performance with a couple of characters on screen.

All demos now come in 2 flavours: "regular" and "asm.js".

ASM.JS is a Mozilla project to define a small subset of Javascript which can be exceptionally well optimized. More about that here: http://asmjs.org/

And I identified the long pause at the start of the map viewer demo, originally I thought this would be caused by generating the collision mesh, which is built at startup from tens-of-thousands of very small mesh fragments, but surprisingly this is extremely fast. The pause is actually caused by parsing the structure of an SQLite database file and reading many small items from the database. Replacing this with a more efficient "table data" subsystem is the next thing on my weekend todo list. The SQLite stuff is really a left-over from the single-player Drakensangs where the world-state was loaded from and written back to SQLite database files.

That's it for today!

Diminishing Returns

2013-02-10T19:24:00.004+01:00

Weekend was kinda semi-successful as far as coding is concerned. I tried various ways to reduce GL calls further, and was able to reduce the number of GL calls by about 25%: from about 4100 down to about 3000 in the initial screen of the Drakensang Online map viewer demo. Although this sounds pretty good, I'm a bit disappointed because I was hoping that bundling vertex data chunks into big vertex buffer would have a bigger effect:

- Bundling vertex data into big vertex buffers cut the number of glVertexAttribPointer() calls by almost half from about 950 down to about 500. With the GL_vertex_array_object extension however, I could save double the GL calls for "free" (so the demo would be down to 3100 GL calls without any additional optimizations), and the savings would be more consistent (right now it depends a lot on the order of draw calls). The bundling added *a lot* of complex code, so it's probably not really worth it, since at least Chrome already supports OES_vertex_array_object in WebGL, so it would make more sense to support that.

- All the rest was gained by simply filtering redundant texture updates (glActiveTexture, glBindTexture, glUniform1i). This was a big win for very little code, but this also varies with the actual textures applied to the objects. Fewer shared textures means more updates.

I also tried to generally filter redundant shader uniform updates, but with little effect. Apart from the texture updates, an entire frame had less then 10 redundant uniform updates, so not worth it.

I'll give the GL call optimization a little rest for now and concentrate on adding features. There's still some untapped potential in grouping transform matrix updates into arrays, and by better sorting inside batches. But right now I've had enough ;)

A Radeon Fix and More

2013-01-23T21:08:00.000+01:00

The Nebula3/emscripten demos (http://n3emscripten.appspot.com) had a serious performance problem on Macs with Radeon GPUs in the instancing demos. Problem was that my pseudo-instancing code used an additional vertex-buffer with 1-dimensional ubyte vertex components as fake InstanceIds. This worked fine on nVidia and Intel GPU, but triggered a horrible slow-path in the OSX Radeon driver. After replacing this with ubyte4 components everything worked fine on Radeons, but I wasn't happy that the InstanceId buffer would now be 4 times as large, with 3/4 of the the size dead weight. Then today in the train from Hamburg back to Berlin the embarrassingly obvious solution occured to me to stash the InstanceId in the unused w-component of the vertex normals. These are in packed ubyte4 format, with the last byte unused. And with this simple fix I could get rid of the second vertex buffer completely and actually throw away most of the pseudo-instancing code. Win-Win!

And now on to the actual issue: I didn't really pay attention to the code path which is used if the GL vertex array object extenion isn't available, and I was shocked when I discovered that the dsomapviewer demo performs 7000 GL calls per frame (not draw-calls, but all types of GL calls), and then I was astonished that Javascript+WebGL crunches through those 7k calls without a problem even on my puny laptop. But something had to be done about that of course.

OpenGL / WebGL without extensions is very verbose even compared to Direct3D9. To prepare the geometry for rendering, you need to bind an vertex buffer (or several), bind an index buffer, and for each vertex component call glEnableVertexAttribArray() and glVertexAttribPointer(), aaaand each unused vertex attribute must be disabled with glDisableVertexAttribArray(). Depending on the max number of vertex attributes supported in the engine, this can add up to dozens of calls just to switch geometry. And whenever a different vertex buffer is bound, at least the glVertexAttribPointer() functions must be called again and if the vertex specification has changed, vertex attribute arrays must be enabled or disabled accordingly.

With the vertex array object extension all of this can be combined into a single call.

This particular part of defining the vertex layout is by far the least elegant area of the OpenGL spec, and even the vertex array object stuff could be nicer. To me it doesn't make a lot of sense to include the buffer binding in the vertex attribute state, keeping the buffer separate from the vertex layout would make more sense IMHO. But enough with the ranting.

Other high-frequency calls are the glUniformXXX() functions to update shader variables, and the whole process of assigning textures to shaders. Un-extended WebGL doesn't provide functions to bundle these static shader updates into some sort of buffers.

These types of high-frequency calls is exactly what we don't want in Javascript and WebGL. In a native OpenGL app, these calls are usually extremely cheap, so it doesn't matter that much. But when calling a WebGL function from emscripten, there's quite a lot of overhead (at least compared to a native GL app). First, emscripten maintains some lookup tables to associate numeric GL ids with Javascript objects. Then the WebGL JS functions are called, in Chrome, these calls are serialized into a command buffer which is transferred to another process, in this GPU process the commands are unpacked, validated, and the actual GL function is called. But it doesn't end there. On Windows, the ANGLE wrapper translates the OpenGL calls to Direct3D9 calls. So what's an extremely cheap GL call in a native app, comes with some serious overhead in a WebGL app. Considering all this it is really mind-blowing that WebGL is still so fast!

All this means though, that it really makes a lot of sense to filter redundant GL calls, especially in a WebGL application, and every GL extension which helps to reduce the number of API calls is many times more valuable under WebGL!

So my mission in the train from Berlin to Hamburg and back today was to filter out those redundant GL calls.

First I wanted to know what calls are actually the problem. The OSX OpenGL Profiler tool can help with this. It records a trace of all OpenGL calls, can create a quick stat of the most-called functions, and the sequence of calls with their arguments reveals which calls suffer most from redundancy.

Which are in the dsomapviewer demo: glEnableVertexArray(), glDisableVertexArray(), glBindBuffer() and glUseProgram().

Apart from filtering those lowlevel calls I also implemented a separate high-level filter which skips complete mesh assignment operations (that whole call sequence of buffer bindings and vertex attribute specification I talked about before).

All in all the results where encouraging: per-frame GL calls dropped from 7k down to 4k. In comparison: when using the vertex array object extension the number of GL calls goes down to about 3k.

This could be improved even more by reducing the number of vertex buffers, and bundling the vertex data of many graphics objects into one or few big vertex buffers, since then much fewer buffer binds and vertex attribute specification calls would be needed (at least if they occur in the right sequence). But for this I would either need the glDrawElementsBaseVertex() function, which is not available in WebGL, or I would need to fix-up lots of indices whenever vertex data is created or destroyed (but this would limit the size of one compound vertex buffer to 64k vertices, and limit the efficiency of the bundling, hmm...).

Anyway, to wrap this up, Chrome already exposes the OES_vertex_array_object extension, and an ANGLE_instanced_arrays extension seems to be on the way. Both should help a lot to reduce GL calls already. Then the only remaining problem is texture assignment and uniform updates in scenes with many different materials.

But I think before working on reducing GL calls even more I'll try to do something about then stuttering when new graphics assets are streamed in.

Over & Out,
-Floh.

A Drakensang Online map viewer in emscripten

2013-01-19T18:36:00.001+01:00

Update 2: The OSX/Radeon performance problem should be fixed now. See here: http://flohofwoe.blogspot.de/2013/01/a-radeon-fix-and-more.html

Update: Just found out that the demo runs incredibly slow on a 15"Mac when running on the discrete AMD Radeon HD 6770M chip (it's actually much faster on the integrated Intel HD 3000). This is both on Chrome and Firefox, reason unknown yet. So if you have one of these, note that the demo runs actually a lot smoother ;)

I did a very simple proof-of-concept Drakensang Online map viewer in Nebula3/emscripten (as always, Chrome or Firefox required), to see how JS+WebGL can deal with a close-to-real-world 3D scenario:

Drakensang Online map viewer

This is work in progress and I will spend more time with optimizations before moving on to the next demo.

You'll notice that there's still frame-rate-stuttering when moving around the map (with left-mouse-button + dragging). The bad type of stuttering is caused by asset loading which happens on demand when new graphics objects are pulled in as they enter the view volume. I don't know yet what causes the lighter stuttering when moving around in areas which are completed loaded. I need to do a detailed profiling session to figure out what's going on there exactly. The stuttering also happens (to a lesser extend) in the native OSX version of the demo. It's most likely the preparation and creation of OpenGL resources, like vertex buffer, index buffers and textures. I will need to figure out how to move more of the asset creation stuff out of the main thread.

The demo is also quite demanding on WebGL. Despite the pseudo-instancing which I implemented recently there's still a lot of OpenGL calls per frame. Support for the OES_vertex_array_object (Chrome already exposes this) and something like ARB_instanced_arrays would help a lot to reduce the number of GL calls drastically (the JS profiler currently shows the vertex array definition as the most expensive rendering-related code, followed by the matrix array uniform updates for the pseudo instancing code).

Finally I've added a new Nebula3 code module to this demo: the ODE-based physics and collision subsystem is now also running in emscripten (no changes were necessary), the demo sets up a static collide world at startup and uses this to perform stabbing checks under the mouse pointer. Unfortunately adding ODE almost doubled the size the of the generated Javascript code. This is another incentive to finally get rid of our (somewhat bloated) physics wrapper code and ODE, and build a new slim collision system, probably on top of the Bullet collision classes (we're mainly using the current physics wrapper for simple collision checks on a static collide world in the live version of Drakensang Online, so not much of value will be lost).

Also, originally I wanted to include SQLite into the demo, since additional map info is currently stored in an additional SQLite file (lighting information, player start position, etc...). But this didn't work out of the box because SQLite's file i/o code must be adopted.

This wouldn't be hard to fix, but I actually want to get rid of SQLite for a long time. SQLite was really useful as save-game system in the single player Drakensang games, but if you don't need to save game world changes back, a complete SQL implementation in the client is just overkill. So this is another good reason to finally get started with a nice and small TableData-subsystem in Nebula3.

The frame-stuttering is a tiny bit disheartening, but on the other hand this is to be expected when bringing a complex code base over to a new platform. Most important right now is to really know what's going on, so I will probably spend some time adding profiling code and do some performance analysis next - together with text rendering to get some continuous debug statistics output on screen.

Exciting stuff :D

Multithreading in emscripten with HTML5 WebWorkers

2013-01-13T15:31:00.000+01:00

Multithreading in emscripten is different from what us C/C++ coders are used to. There is no concept of threads with shared memory state in Javascript, so emscripten can't simply offer a pthreads wrapper like NaCl does. Instead it uses HTML5 WebWorkers and a highlevel message-passing API to spread work across several CPU cores.

You basically pass a memory buffer over to the worker thread as input data, the worker thread does its processing and passes a memory buffer with the result data back to the main thread.

The downsides are (1) you can't simply port your existing multi-threaded code over to emscripten, (2) it is (currently) somewhat expensive to pass data around since it involves copying, and (3) you cannot express all multithreading patterns in emscripten. The upside is though, that it's really hard to shoot yourself in the foot, since there's no shared state, and all the multithreading primitives you love to hate (like mutexes, semaphores, cond-vars, atomic-ops) simply don't exist.

Let's have a quick look at emscripten's worker API, only 4 API-functions and 2 user-provided functions are necessary:

worker_handle emscripten_create_worker(const char* url);

This create a new worker object, it takes the URL of a separate emscripten-generated Javascript file.

The worker file must export at least one C-function (name doesn't matter, but the function name must be explicitely exported using emscripten's new "-s EXPORTED_FUNCTIONS" switch so that it isn't removed by dead-code elimination. The worker function prototype looks like this:

void dowork(char* data, int size);

The arguments define the location and size of the input data.

The function to invoke the worker is:

void emscripten_call_worker(worker_handle worker, const char *funcname, char *data, int size, void (*callback)(char *, int, void*), void *arg);

This takes the worker handle returned by emscripten_create_worker(), the name of the worker function (in our case "dowork"), a pointer to and size of the input data, a completion callback function pointer, and finally a custom argument which is passed through to the completion callback to associate the completion call with the invocation call.

At some point after emscripten_call_worker() is called, the dowork-function will be called in the worker thread with a data pointer and size. Since the worker has its own address space, the actual pointer value will be different from the pointer value in the emscripten_call_worker call of course.

The worker function now uses this input data to compute a result, and (optionally) hands this result back to the main thread using this function:

void emscripten_worker_respond(char* data, int size);

The return-data will be copied inside the function, so if the worker function had allocated a result buffer it remains the owner of that buffer and is responsible to release it.

Finally, once the worker has finished, the completion callback will be called on the main thread with the result data, and the custom arg given in the emscripten_call_worker() call:

void completion_callback(char* data, int size, void* arg);

The callee does not gain ownership of the data buffer, thus it must read / copy the received data but not write to, or free the buffer.

Finally there's a function to destroy a worker:

void emscripten_destroy_worker(worker_handle worker);

As with threads, creating and destroying workers is not cheap, so you should create a couple of workers at the start of the application and keep them around, instead of creating and destroying workers repeatedly. It's also wise to batch as much work as possible per worker invocation to offset the call-overhead as much as possible (don't call a worker many times per frames, ideally only once), but this is all pretty much common sense.

The worker Javascript file must be created as a separate compilation unit, it's a bit like on the PS3 where the SPU code also must be compiled into small, complete "SPU executables". To keep the code size small I decided to keep the runtime environment in the worker scripts as slim as possible, there's no complete Nebula3 environment, only a minimal C runtime environment. But this is not a limitation of emscripten, only a decision on my part. Most of the time the workers will contain simple math code which loops over arrays of data instead of high-level object-oriented code. To avoid downloading redundant code it might also make sense to put several worker functions into a single JS file.

The updated Nebula3/emscripten demos at http://n3emscripten.appspot.com now decompress the downloaded asset files in up to 4 WebWorker threads in parallel to the main thread, this speeds up asset loading tremendously and avoids the excessive frame hickups which happened before. This is important, since real-world Nebula3 apps stream asset data on demand while the render loop is running. The whole stuff took me about half a day, but unfortunately I stumbled across a Chrome bug which required a small workaround (see here: http://code.google.com/p/chromium/issues/detail?id=169705).

It's not completely perfect yet. There's data copying happening on the main thread, and there's also some expensive stuff going on when creating the WebGL resources (for instance vertex and index data is unrolled for the instanced rendering hack). The ultimate goal is to move as much resource creation work off the main thread in order to guarantee smooth rendering while resources are created.

There are also browser improvements in sight which will make WebWorkers more efficient in the future, mainly to avoid extra data copies by transferring ownership of the passed data over to the web worker, basically a move instead of a copy.

And that's it for today :)

Happy New Year 2013!

2013-01-04T19:24:00.003+01:00

I've been playing around a bit more with the Nebula3/emscripten port over the holidays. Emscripten had some nice improvements during the past 2 months, mainly to generate smaller and faster code, and to drastically reduce code generation time in the linker stage (read this up on azakai's blog).

The work I did on my experimental Nebula3 branch were only partially emscripten-related: The biggest chunk of work went into refactoring to adapt the higher level parts of the rendering pipeline for the new CoreGraphics2 subsystem (lighting, view volume culling, and the highlevel graphics subsystem which is concerned about Stages, Views and GraphicsEntities). A lot of code was thrown away or moved around, but from the outside everything looks quite similar as before. External code which depends on the Graphics subsystem must be fixed-up, but not rewritten.

Another big chunk of work went into implementing instanced rendering for the new CoreGraphics2 system. OpenGL offers several extensions for instanced rendering, but since none of the current WebGL implementations support any of these extensions I first wrote a fallback solution which works without extensions, but uses bigger "unrolled" vertex- and index-data, and a instance-matrix palette in the vertex shader. With the current implementation, up to 64 instances can be collapsed into a single drawcall. This depends on the number of available vertex shader uniforms, and since the ANGLE wrapper used by Chrome and Firefox on Windows generally restricts the number of vertex shader uniforms to 254 I had go with only 64 instances per drawcall. This restricts the usage scenarios of this approach, but when rendering a Drakensang Online map (for instance), this comes pretty close to the average number of instances of environment objects in the view volume. For particle rendering this approach would be useless though.

I also rewrote the emscripten filesystem wrapper. The original implementation was only a quick hack to get data loaded into the engine at all. I wrapped this now into a proper subsystem which uses new emscripten API calls to directly download data into a memory buffer without mirroring the data into a "virtual filesystem", and the new implementation also accepts the file compression of Drakensang Online's HTTP filesystem (it's not the complete HTTP filesystem implementation yet though, the table-of-content-files are ignored, as well as the per-file MD5 hashes, and there's no local file cache apart from the normal browser cache). Also, while the emscripten filesystem wrapper is asynchronous, it is not yet multithreaded through the new WebWorker API. Decompression currently happens on the main thread and may lead to frame stuttering, but the plan is to move this into separate worker threads.

Finally I've uploaded a few new demos to http://n3emscripten.appspot.com. As always you should use an uptodate Chrome or Firefox browser to try them out.

First, here's the old Dragons demo, recompiled with the latest emscripten version. Thanks to the improvements in emscripten, and the house-cleaning to remove old code, the (compressed) download size of the Javascript-code is now only 308kByte:

Dragons Demo (Cursor up to add more dragons)

Next is a demo for the new instanced rendering. On startup, 1000 independently animated cubes are rendered, and by pressing cursor-up you can add 1000 more. There's also 128 point lights in the scene. Every 1000 cubes require about 32 draw-calls (that's (1000/64)*2, the instancing collapses 64 cubes into one draw call, and then *2 because of the NormalDepth- and Material-Passes of the Light Pre-Pass Renderer. For every cube, a world-space transform matrix is computed per frame on the CPU (a conversion from polar-coordinates to cartesian coordinates, involving two sin() and two cos(), and a matrix-lookat involving several normalizations and cross-products.

Pseudo Instancing

By hitting the space-key you can also enable a disco-light posteffect for giggles, this does an additional single-pass fullscreen posteffect which does a lot of texture sampling:

Pseudo Instancing with Disco posteffect (press Space)

And finally I wrote a little Drakensang Online monster viewer. With cursor-up/down you can switch to the next/previous monster, with cursor-right you can flip between different skin-lists (appearances), and with cursor-left you can toggle a few animations (usually idle and running anims). Obviously the material shader is different from Drakensang Online (the color texture is replaced with just white, the specular effect is exaggerated (which actually is a nice show-case for the really good normal-maps of our character models). This is only a snapshot of what's currently in the game, especially most of the animations are not included. The strange cubes which are displayed sometimes are the mesh-placeholder objects, I think I'll remove them and just use no placeholder as long as the mesh is not loaded, at least it shows that the placeholder system is working right ;)

Drakensang Online Monster Viewer

That's it for today :)

CoreGraphics2

2012-12-16T18:06:00.000+01:00

That's Twiggy's official name now.

I've basically written a vertical slice of the new Nebula3 Render Layer during the past few weekends where I'm trying out a few ideas of what the Nebula3 rendering system will look like in the future.

The lowest-level subsystem is CoreGraphics2, which I wrote about already a little bit.

It wraps the host platform's 3D API (e.g. OpenGL or Direct3D), the rendering vocabulary is higher level / less verbose then OpenGL/D3D. It runs the render thread, but can be compiled without threading (on the emscripten platform for instance). There's a facade singleton object (CoreGraphics2Facade) which wraps the entire functionality into a surprisingly simple interface.

CoreGraphics2 works with only 5 resource types:

Texture: Just what the name implies, a texture resource object. This also includes render targets.
Mesh: This encapsulates all the required geometry data for a drawing operation: vertex buffer, index buffer (optional), vertex layout / vertex array definition, and "primitive groups" (basically sub-mesh definitions).
DrawState: This wraps all the required shader and render-state data for a drawing operation: a reference to a shader object, shader constants (one-time-init, immutable), shader variables (mutable) and an (immutable) state-block for render-states.
Pass: A pass object holds all required data for a rendering pass, this includes a render-target-texture object, and a DrawState object which defines state which is valid for the rendering pass. All rendering must happen inside passes. Typical passes in a pre-light-pass renderer are for instance the NormalDepth-Pass, the Light-Pass, the Material-Pass, and a Compose-Pass. The pass object also contains the information whether and how the render target should be cleared at the start of the pass.
Batch: A batch object just contains a DrawState object which defines render state for several draw operations, so this is just a way to reduce redundant state switches.

Resource objects are opaque to the outside. To the caller, these are just ResourceId objects, there's no way to directly access the data in the resource objects (since they actually live in the render thread).

Resource creation happens by passing a Setup object to one of the Create methods in the CoreGraphics2Facade singleton. There's one Setup class for each resource type (so basically TextureSetup, MeshSetup, DrawStateSetup, PassSetup and BatchSetup). The Setup object basically describes how the resource should be created and shared (for instance when creating a texture resource, the Setup object would contain the path to the texture file, whether the texture should be loaded asynchronously, whether the texture object should be a render target, and so on). The render thread will keep the Setup objects around, so it has all information available to re-create the resource (for instance because of D3D's lost device state, or for more advanced resource management where currently unused resources can be removed from memory, and re-loaded later).

All rendering happens by calling methods of CoreGraphics2Facade:

Begin / End methods:

These methods structure a frame into segments.

BeginFrame / EndFrame: Signal the start and end of a render frame.
BeginPass / EndPass: Signal start and end of a rendering pass. BeginPass takes the ResourceId of a Pass object, makes the render target of the pass active, optionally clears the render target, and applies the render state of the DrawState object of the pass.
BeginBatch / EndBatch: Signal start and end of a rendering batch. This simply applies the render state of the DrawState object of the batch.
BeginInstances / EndInstances: This is where it gets interesting. BeginInstances sets all the required state for a series of Draw commands. It takes a Mesh ResourceId, a DrawState ResourceId, and a "shader variation bitmask". The bitmask basically selects a "technique" from the shader (in D3DXEffects terms). For instance, to select the right shader technique for rendering the NormalDepth-pass of a skinned object, one would pass "NormalDepth|Skinning" as the bitmask.

Apply methods:

This method group applies dynamic state changes during a frame:

ApplyProjectionTransform, ApplyViewTransform, ApplyModelTransform: Sets the projection, view and model matrices.
ApplyVariable: applies a shader variable value to the currently active DrawState object (which has been set during BeginInstances). This is a template method, specialized for each shader variable data type (float, int, float4, matrix44, bool).
ApplyVariableArray: same as ApplyVariable, but for an array of values.

Draw methods:

This method group performs actual drawing operations:

Draw: Performs a single draw call, must be called inside BeginInstances/EndInstances. Renders a PrimitiveGroup (aka material group) from the currently active Mesh, using the render state defined in the currently active DrawState. For non-instanced rendering one would usually perform several ApplyModelTransform() / Draw() pairs in a row.
DrawInstanced: Like Draw, but takes an array of per-instance transforms to render the same mesh at many different positions. Tries to use some sort of hardware instancing, but falls back to a "tight render loop" if no hardware instancing is available.
DrawFullscreenQuad: simply render a fullscreen quad with the currently set DrawState, this is used for fullscreen-post-effects

And that's it basically. I'm quite happy with how simple everything looks from the outside, and how straight-forward the innards work. For instance, leaving the shader system aside (which is implemented in a separate subsystem CoreShader), the OpenGL specific code in CoreGraphics2 is just 7 classes, and the biggest file is around 600 lines of code.

And it's simple to use, for instance here's the render loop to render the point lights in the new LightPrePassRenderer (hopefully the Blogger editor won't screw up my formatting):

The only thing that's still missing from CoreGraphics2 is dynamic resources and a plugin system to extend functionality of the render-thread side with custom code (for instance for non-essential stuff like runtime resource baking).

As much as I'd love to have a rendering system where dynamic resources aren't needed at all, there's no way around them yet. We still need them for particle systems and UI rendering.

On the front-end of the render layer, there's the new Graphics2 subsystem. The changes are not as radical as in CoreGraphics2 (with good reason because changes in this subsystem would affect a lot of high level gameplay code). There are still the basic object types Stage, View, Camera, Light and Model. There's now a new GraphicsFacade object, which simplifies setup and manipulation of the graphics world drastically. And I tried out a new component-system for GraphicsEntities (Models, Lights and Cameras). Instead of a inheritance hierarchy for the various GraphicsEntity types, there's now only one GraphicsEntity class which owns a set of Component objects. The combination of those components is what turns a GraphicsEntity into a visible 3D model, a light source, or a camera. The main driver behind this was that 90% of all data in a ModelEntity was character related, but less then 10% of graphics objects in a typical graphics world are actually characters.

I've split the existing functionality into the following entity components:

TransformComponent: defines the entity's position and bounding box volume in world space.
TimingComponent: keeps track of the entity-local time
VisibilityComponent: attached the entity to the Visibility subsystem (view frustum culling)
ModelComponent: renders the entity as a simple 3D object
CharacterComponent: additional functionality for skinned characters (animations, skins, joint attachments, ...)
LightComponent: turns the entity into a light source
CameraComponent: turns the entity into a camera

This component model hasn't really been written to allow strange combinations (you might be tempted to attach a CameraComponent to a Character-entity for a first person shooter). Theoretically something like this might be even possible, but I don't think it is a good idea. The driving force behind the component model was cleaner code and better memory usage.

Mea Culpa

2012-10-23T21:00:00.002+01:00

Ok, let me just say that I went from "Saulus to Paulus" (as we say in Germany) in the past few days. In my ongoing stealth mission to evaluate all the C++-to-Web technologies currently available (Google Native Client, Adobe's Flash C/C++ compiler, and Mozilla's emscripten) I actually wanted to pick Adobe's solution next, since I didn't really believe that emscripten's approach of compiling C++ to Javascript could possibly work. I had a fixed idea in my mind, how fast a C++-to-Bytecode VM solution would be (that's what Adobe is doing), and how fast Javascript could possibly be, and Javascript would lose by a long shot. No way I thought it possible to run really math heavy code in a language with such a shitty type system (excuse my french).

There are numbers flying around like 25% to 50% of native performance (even up to 80% for Adobe's solution) which I thought to be extremely optimistic even for hand-picked benchmarks. For instance if you look at Epic's famous Unreal Flash demo, there's not a lot of dynamic stuff happening in the 3D world you're moving through. Sure it looks impressive, but it mainly demonstrates a good art style and how fast your GPU is, but doesn't say much about how efficient the "CPU code" is running in the Adobe VM.

Then I started to look closer at emscripten, spent a few days with porting, and imagine my surprise when I first started the optimized version of this:

http://n3emscripten.appspot.com (disclaimer: uptodate Firefox or Chrome recommended, no IE)

...and I added dragons and more dragons, and even more dragons, until the frame rate finally started to drop. Of course it's not Native Client performance, but it is much (much!) better then I expected.

Let me explain what you're seeing:

The demo is built from a set of Nebula3 modules consisting of about 120k lines of C++ code cross-compiled to Javascript+WebGL through the emscripten compiler infrastructure. There is a lot (really a LOT) of vector floating point math C++ code running in the animation engine because I must admit that I actually wanted to "break" the JS engine and show how incredibly much faster NaCl would be. Well, that didn't quite work out ;)

Of those 120k lines of code, only a few hundred lines are actually specific to the emscripten platform. So there's less then 0.5% of platform specific code, and about 99.5% of the code is exactly the same as in the NaCl demo, or an actual "native" (OpenGL-based) desktop version of the demo. If you take all of Nebula3 (about half a million lines of C++ code), then the ratio of platform specific code is even more impressive.

Let this sink in for a while:

You can take a really big C++ code base with dozens of man years of engineering effort and a mature asset pipeline attached to it, spend about 2 weekends(!) of tinkering, and can run your code at a really good performance level in a browser without plugins! You still have to be realistic about CPU performance of course. It helps if a game is designed to make relatively little use of the CPU, and move as much work as possible onto the GPU, but these are the normal realities of game development. Of all the target platforms for a project you should choose the weakest as the "lead platform", make the game run well there, and use the extra power of the other platforms for non-essential, but pretty "bells'n'whistles".

And you don't have to burn bridges behind you: You can still use the exact same code base and asset pipeline and create traditional native applications for mobile platforms, desktop apps, or game consoles. And all of this in a programming language and a graphics API which many considered almost on its death bed a few years ago.

I'd say that C++ and OpenGL are on a goddamn come-back tour right now :)

Not all is golden though, each of the C++-to-web solutions has at least one serious weakness:

Native Client: extremely fast and feature rich, but only supported by Google
emscripten (more specifically WebGL): not supported by Microsoft
Adobe flascc: Adobe wants a "speed tax" if you're using certain "premium features" required for high-performance 3D games and earn money with it

Sucks badly, since most of this is purely politics-driven, not for technological reasons.

So... originally I wanted this blog post to be a technical post-mortem of the emscripten port, but on one hand it's already quite long, and on the other hand: there's really not much to write about, since it went so smooth.

I installed the SDK and a few required tools to my MacBook, wrote (yet another) simple CMake toolchain file, and was able to compile most of Nebula3 out of the box within an hour. The most exciting event was that I found a minor bug in the OpenGL wrapper code, which was fixed within the day by the emscripten team (kudos and many thanks to kripken (aka azakai) for being so incredibly fast and helpful).

The only area where I had to spend a bit more time was on threading (or rather the lack thereof). Since emscripten (like NaCl) is running in the browser context, it suffers from many of the same limitations - like not being able to do synchronous IO, and you cannot "own the game loop" but need to run in little per-frame time-slices inside callbacks from the browser. NaCl offers a working pthreads API to workaround these limitations, but emscripten cannot support true threading since the underlying Javascript runtime doesn't allow sharing state between threads. This looked like a real show stopper to me in the beginning, but after a few nights of sleeping over this problem I found a really simple solution by moving to a higher level into Nebula3's asynchronous messaging system. Up there it was relatively easy to replace the multithreaded message handler code with a frame-callback system with minimal code changes.

It sucks a bit right now that everything runs on the main thread (so the current N3 demo cannot take advantage of multiple CPU cores), but a solution for a higher level multithreaded API based on HTML5 webworkers is in the works right now (I really can't believe how fast these guys are!).

I'm tempted to write a lot more of how incredibly clever the C++-to-JS cross-compilation is, and how the generated JS code can be even faster then handwritten code, and how surprisingly compact the generated code is, but if you're getting all exited about this type of stuff it's better if you're reading it first hand:

https://github.com/kripken/emscripten/wiki

So where next? I'll polish the emscripten port a bit more and implement support for the new web worker API to load and uncompress asset files in the background, and then take a little detour and port the Dragons demo to Adobe's flascc (perfect timing since they just went into open beta). After that I need to do some cleanup work on all three ports, and on the higher level parts of the render pipeline since the low level rendering code has been replaced with the new "Twiggy" stuff.

In the meantime, here's another exiting development: RakNet has started support for Google Native Client: http://www.jenkinssoftware.com/forum/index.php?topic=4980.0.

All the pieces are slowly falling into place...