4 May 2013

Minor demos and web page update

Couple of minor changes at http://www.flohofwoe.net:
  • I have removed the non-asm.js demos. Since the asm.js code generation in emscripten is now always faster then the "traditional" code generation, it doesn't make sense to have the non-asm.js code around. I'll keep support for the old code-generation in my build pipeline for now, to be able to run comparisons between the new and old code from time to time though.
  • The demos are now compiled with link-time-optimization enabled. Previously this had subtle and hard to debug code generation problems, but it looks like this is fixed now (fingers crossed). Performance or code size doesn't seem to be different that much however.
  • Demos have been recompiled with the latest emscripten incoming branch.
  • I added experimental support for uncompressed textures if the WebGL implementation doesn't support DXT textures (e.g. mobile platforms). This will decompress textures on the fly after download. For now this is just a workaround/hack and hasn't been tested that much. Also, since uncompressed textures are 4..8x bigger, this isn't really useful for complex games.
  • I have added a high-level source code page for people who like to read some code: http://www.flohofwoe.net/sources.html
  • Finally, http://n3emscripten.appspot.com will no longer be updated, and I've put a link to the new demos there.
-Floh.

25 Apr 2013

Quo Vadis Talk, New Demo Place

Quick update:

Just came back from Quo Vadis 2013 in Berlin where I talked about "C++ on the Web" in front of a crowded room (thanks to all who've been there :), the slides are here:

http://de.slideshare.net/andreweissflog3/quovadis2013-cpp-ontheweb

And I have moved the Nebula3/emscripten demos to my own web site here:

http://www.flohofwoe.net/demos.html

The demos at the old appspot.com URL haven't been updated in a while. When I get around it I'll redirect to the new demo page from there.

Over and out :)
-Floh.

22 Mar 2013

Why I spend my precious spare time with emscripten

I recently realized that I have spent much more time with emscripten then any other "weekend project" so far. At least the emscripten-based demos became the most advanced on any of my spare-time coding platforms in the past 2 years like iOS, Android, Google Native Client, flascc.

I think it comes down to "open, free and painless", for spare-time projects these are all extremely important points. I want to spend my free time with stuff that is fun.

Let's look why the other stuff isn't as much fun:

iOS: The tools you need for development are all free, XCode is a very slick IDE to work in, and unlike VisualStudio there's no artificial distinction between a (feature-cut) free and a (pricy) professional version. So far so good. The pain starts when you want to run your code on your actual iOS device. Welcome to provisioning profile hell. First you need to hand over $99 per year for the privilege to run your own code on you own hardware, but that's the least of it. Next you need to create "provisioning profiles" on Apple's developer portal, registering each team member, device and application and set up who may do what. In the end you essentially get per-app/per-device code-signing-certificates which expire every three months. So all the iOS demos which I did 2 years ago don't work anymore unless I go through all that hell again. Nope.

Android: Android C++ development sucks, plain and simple. It's a pain in the ass to set up (it's less painful if you use nVidias ready-made installer), remote debugging a native app is so slow it's essentially useless, and you can't use the cool new stuff since most of the world is still running an Android version from the stone age. To be fair, this was all 1.5 years ago, but I have little motivation to waste further weekends in finding out whether things have improved since then ;)

Google Native Client: The main reasons why I stopped dabbling with Native Client is that it is still not opened up (only works with Chrome Web Store bundled apps), and pNaCl seems to take forever to be finished. To be fair, Native Client has very good middleware support (like FMOD or RakNet), but it doesn't look like it will ever be implemented outside of Chrome.

flascc: I played around with flascc for a weekend or two, 2 main reasons why it didn't set my heart on fire: (1) Compiling/linking is extra-ordinarily slow AND/OR uses infinite amounts of RAM. For reasonably big code bases (like Nebula3) it's unusable because my 4GB Mac simply ran out of memory. (2) since working with flascc is so damn slow I wasn't motivated to actually go on with writing a Stage3D wrapper for N3's rendering layer.

So all in all, emscripten is the most frictionless way to write and and actually publish 3D demos for me. I can host the demos wherever I want, update them without a certification or signing process getting in the way, the demos won't expire, they are automatically multi-platform and finally, there's no vendor or platform lock-in. Most of the code I'm writing is platform-agnostic C++ and will compile and run anywhere, and the host platform's "API foot print" is minimal: a subset of POSIX and OpenGL, which will also compile almost anywhere else with minimal changes.

18 Mar 2013

Updated Nebula3/emscripten Demos

Update 3: I replaced SQLite with a TableData addon, this reduces the map-viewer-demo size from 8 MB down to 5 MB (uncompressed), and reduces startup time dramatically.

Update 2: Demos should now properly work on all WebGL configs again (which support DXT textures to be exact). I've been using more then 254 vertex shader uniforms, and at least ANGLE restrict this number even if the GPU could actually handle a lot more).

Update: Demos don't work on Windows and some other configs since one of the new GLSL shaders doesn't compile. Tested configs are: OSX 10.7.5 with GeForce 9400M, Intel HD3000, HD4000 and Radeon HD 6770M. Fix is coming later today.

Finally a new demo update! If you're a Chrome user, please be aware that you need to run these demos in the very latest Chrome Canary (Version 27.0.1444.3 canary) since this contains a bugfix in the V8 Javascript engine (details are here: https://code.google.com/p/chromium/issues/detail?id=177883). This bug was also the reason why I held back updates for so long, I couldn't overwrite the version which reproduces this bug, but I also didn't feel like setting up yet another AppEngine project.

Updated demos are here: http://n3emscripten.appspot.com

The DSO map viewer demo is now much closer to the actual map renderer of the Drakensang Online client:


The ground-decals system has been moved over which helps a lot in hiding the tiling structure of the level. The rendering pipeline now includes posteffects like bloom and color-balancing. You're now controlling a "player character", and I added a few more "NPCs" to the map in order to check performance with a couple of characters on screen.

All demos now come in 2 flavours: "regular" and "asm.js". 

ASM.JS is a Mozilla project to define a small subset of Javascript which can be exceptionally well optimized. More about that here: http://asmjs.org/

And I identified the long pause at the start of the map viewer demo, originally I thought this would be caused by generating the collision mesh, which is built at startup from tens-of-thousands of very small mesh fragments, but surprisingly this is extremely fast. The pause is actually caused by parsing the structure of an SQLite database file and reading many small items from the database. Replacing this with a more efficient "table data" subsystem is the next thing on my weekend todo list. The SQLite stuff is really a left-over from the single-player Drakensangs where the world-state was loaded from and written back to SQLite database files.

That's it for today!

10 Feb 2013

Diminishing Returns

Weekend was kinda semi-successful as far as coding is concerned. I tried various ways to reduce GL calls further, and was able to reduce the number of GL calls by about 25%: from about 4100 down to about 3000 in the initial screen of the Drakensang Online map viewer demo. Although this sounds pretty good, I'm a bit disappointed because I was hoping that bundling vertex data chunks into big vertex buffer would have a bigger effect:

- Bundling vertex data into big vertex buffers cut the number of glVertexAttribPointer() calls by almost half from about 950 down to about 500. With the GL_vertex_array_object extension however, I could save double the GL calls for "free" (so the demo would be down to 3100 GL calls without any additional optimizations), and the savings would be more consistent (right now it depends a lot on the order of draw calls). The bundling added *a lot* of complex code, so it's probably not really worth it, since at least Chrome already supports OES_vertex_array_object in WebGL, so it would make more sense to support that.

- All the rest was gained by simply filtering redundant texture updates (glActiveTexture, glBindTexture, glUniform1i). This was a big win for very little code, but this also varies with the actual textures applied to the objects. Fewer shared textures means more updates.

I also tried to generally filter redundant shader uniform updates, but with little effect. Apart from the texture updates, an entire frame had less then 10 redundant uniform updates, so not worth it.

I'll give the GL call optimization a little rest for now and concentrate on adding features. There's still some untapped potential in grouping transform matrix updates into arrays, and by better sorting inside batches. But right now I've had enough ;)

23 Jan 2013

A Radeon Fix and More

The Nebula3/emscripten demos (http://n3emscripten.appspot.com) had a serious performance problem on Macs with Radeon GPUs in the instancing demos. Problem was that my pseudo-instancing code used an additional vertex-buffer with 1-dimensional ubyte vertex components as fake InstanceIds. This worked fine on nVidia and Intel GPU, but triggered a horrible slow-path in the OSX Radeon driver. After replacing this with ubyte4 components everything worked fine on Radeons, but I wasn't happy that the InstanceId buffer would now be 4 times as large, with 3/4 of the the size dead weight. Then today in the train from Hamburg back to Berlin the embarrassingly obvious solution occured to me to stash the InstanceId in the unused w-component of the vertex normals. These are in packed ubyte4 format, with the last byte unused. And with this simple fix I could get rid of the second vertex buffer completely and actually throw away most of the pseudo-instancing code. Win-Win!

And now on to the actual issue: I didn't really pay attention to the code path which is used if the GL vertex array object extenion isn't available, and I was shocked when I discovered that the dsomapviewer demo performs 7000 GL calls per frame (not draw-calls, but all types of GL calls), and then I was astonished that Javascript+WebGL crunches through those 7k calls without a problem even on my puny laptop. But something had to be done about that of course.

OpenGL / WebGL without extensions is very verbose even compared to Direct3D9. To prepare the geometry for rendering, you need to bind an vertex buffer (or several), bind an index buffer, and for each vertex component call glEnableVertexAttribArray() and glVertexAttribPointer(), aaaand each unused vertex attribute must be disabled with glDisableVertexAttribArray(). Depending on the max number of vertex attributes supported in the engine, this can add up to dozens of calls just to switch geometry. And whenever a different vertex buffer is bound, at least the glVertexAttribPointer() functions must be called again and if the vertex specification has changed, vertex attribute arrays must be enabled or disabled accordingly.

With the vertex array object extension all of this can be combined into a single call.

This particular part of defining the vertex layout is by far the least elegant area of the OpenGL spec, and even the vertex array object stuff could be nicer. To me it doesn't make a lot of sense to include the buffer binding in the vertex attribute state, keeping the buffer separate from the vertex layout would make more sense IMHO. But enough with the ranting.

Other high-frequency calls are the glUniformXXX() functions to update shader variables, and the whole process of assigning textures to shaders. Un-extended WebGL doesn't provide functions  to bundle these static shader updates into some sort of buffers.

These types of high-frequency calls is exactly what we don't want in Javascript and WebGL. In a native OpenGL app, these calls are usually extremely cheap, so it doesn't matter that much. But when calling a WebGL function from emscripten, there's quite a lot of overhead (at least compared to a native GL app). First, emscripten maintains some lookup tables to associate numeric GL ids with Javascript objects. Then the WebGL JS functions are called, in Chrome, these calls are serialized into a command buffer which is transferred to another process, in this GPU process the commands are unpacked, validated, and the actual GL function is called. But it doesn't end there. On Windows, the ANGLE wrapper translates the OpenGL calls to Direct3D9 calls. So what's an extremely cheap GL call in a native app, comes with some serious overhead in a WebGL app. Considering all this it is really mind-blowing that WebGL is still so fast!

All this means though, that it really makes a lot of sense to filter redundant GL calls, especially in a WebGL application, and every GL extension which helps to reduce the number of API calls is many times more valuable under WebGL!

So my mission in the train from Berlin to Hamburg and back today was to filter out those redundant GL calls.

First I wanted to know what calls are actually the problem. The OSX OpenGL Profiler tool can help with this. It records a trace of all OpenGL calls, can create a quick stat of the most-called functions, and the sequence of calls with their arguments reveals which calls suffer most from redundancy.

Which are in the dsomapviewer demo: glEnableVertexArray(), glDisableVertexArray(), glBindBuffer() and glUseProgram().

Apart from filtering those lowlevel calls I also implemented a separate high-level filter which skips complete mesh assignment operations (that whole call sequence of buffer bindings and vertex attribute specification I talked about before).

All in all the results where encouraging: per-frame GL calls dropped from 7k down to 4k. In comparison: when using the vertex array object extension the number of GL calls goes down to about 3k.

This could be improved even more by reducing the number of vertex buffers, and bundling the vertex data of many graphics objects into one or few big vertex buffers, since then much fewer buffer binds and vertex attribute specification calls would be needed (at least if they occur in the right sequence). But for this I would either need the glDrawElementsBaseVertex() function, which is not available in WebGL, or I would need to fix-up lots of indices whenever vertex data is created or destroyed (but this would limit the size of one compound vertex buffer to 64k vertices, and limit the efficiency of the bundling, hmm...).

Anyway, to wrap this up, Chrome already exposes the OES_vertex_array_object extension, and an ANGLE_instanced_arrays extension seems to be on the way. Both should help a lot to reduce GL calls already. Then the only remaining problem is texture assignment and uniform updates in scenes with many different materials.

But I think before working on reducing GL calls even more I'll try to do something about then stuttering when new graphics assets are streamed in.

Over & Out,
-Floh.

19 Jan 2013

A Drakensang Online map viewer in emscripten

Update 2: The OSX/Radeon performance problem should be fixed now. See here: http://flohofwoe.blogspot.de/2013/01/a-radeon-fix-and-more.html

Update: Just found out that the demo runs incredibly slow on a 15"Mac when running on the discrete AMD Radeon HD 6770M chip (it's actually much faster on the integrated Intel HD 3000). This is both on Chrome and Firefox, reason unknown yet. So if you have one of these, note that the demo runs actually a lot smoother ;)

I did a very simple proof-of-concept Drakensang Online map viewer in Nebula3/emscripten (as always, Chrome or Firefox required), to see how JS+WebGL can deal with a close-to-real-world 3D scenario:

Drakensang Online map viewer
This is work in progress and I will spend more time with optimizations before moving on to the next demo.

You'll notice that there's still frame-rate-stuttering when moving around the map (with left-mouse-button + dragging). The bad type of stuttering is caused by asset loading which happens on demand when new graphics objects are pulled in as they enter the view volume. I don't know yet what causes the lighter stuttering when moving around in areas which are completed loaded. I need to do a detailed profiling session to figure out what's going on there exactly. The stuttering also happens (to a lesser extend) in the native OSX version of the demo. It's most likely the preparation and creation of OpenGL resources, like vertex buffer, index buffers and textures. I will need to figure out how to move more of the asset creation stuff out of the main thread.

The demo is also quite demanding on WebGL. Despite the pseudo-instancing which I implemented recently there's still a lot of OpenGL calls per frame. Support for the OES_vertex_array_object (Chrome already exposes this) and something like ARB_instanced_arrays would help a lot to reduce the number of GL calls drastically (the JS profiler currently shows the vertex array definition as the most expensive rendering-related code, followed by the matrix array uniform updates for the pseudo instancing code).

Finally I've added a new Nebula3 code module to this demo: the ODE-based physics and collision subsystem is now also running in emscripten (no changes were necessary), the demo sets up a static collide world at startup and uses this to perform stabbing checks under the mouse pointer. Unfortunately adding ODE almost doubled the size the of the generated Javascript code. This is another incentive to finally get rid of our (somewhat bloated) physics wrapper code and ODE, and build a new slim collision system, probably on top of the Bullet collision classes (we're mainly using the current physics wrapper for simple collision checks on a static collide world in the live version of Drakensang Online, so not much of value will be lost).

Also, originally I wanted to include SQLite into the demo, since additional map info is currently stored in an additional SQLite file (lighting information, player start position, etc...). But this didn't work out of the box because SQLite's file i/o code must be adopted.

This wouldn't be hard to fix, but I actually want to get rid of SQLite for a long time. SQLite was really useful as save-game system in the single player Drakensang games, but if you don't need to save game world changes back, a complete SQL implementation in the client is just overkill. So this is another good reason to finally get started with a nice and small TableData-subsystem in Nebula3.

The frame-stuttering is a tiny bit disheartening, but on the other hand this is to be expected when bringing a complex code base over to a new platform. Most important right now is to really know what's going on, so I will probably spend some time adding profiling code and do some performance analysis next - together with text rendering to get some continuous debug statistics output on screen.

Exciting stuff :D