21 Oct 2012

Finished N3/emscripten Demo

Here's a more complete demo of Nebula3 running in Javascript+WebGL via emscripten:

http://n3emscripten.appspot.com/

You can move/rotate the camera with mouse-button + dragging, and add more dragons with the Up key.

Each dragon evaluates its own animation on the CPU, despite that I think that the resulting performance is pretty good.

You can compare the performance to the Native Client version here (requires Chrome):

https://chrome.google.com/webstore/search/nebula3

Have fun,
-Floh.

11 comments:

eRiC Werner said...

Cool! So this is only JS? :O whow!

I tried to increase the dragon amount 10 times on each version: here in FF and Chrome you can feel it dragging on the hardware where as on NaCl there is nothing slow to feel yet with 441 dragons.

Pretty neat anyway! Could you add an fps counter please?

kripken said...

Not surprising the NaCl version will be faster under heavy load. There is always going to be a tradeoff between portability and performance, and NaCl's is very different than the web's.

There is however plenty of room for improvement in the JS version's speed, over the next year I expect to see significant improvements in both compilers to JS and JS engine's optimizations for such code.

Floh said...

Performance and robustness is definitely good enough for big codebases and "serious stuff" already, and there's always the option to offload more work to the GPU. Link time for release mode could be better (it already takes more then 10 minutes for this simple demo).

I'm surprised about the resulting code size. The current demo "exe" is about 2.1 MB, which zip-compresses down to about 450 kByte (but it's not compressed yet). The NaCl's executables are over 4 MB (stripped and uncompressed), and there are currently 2 required, one 32-bit and one 64-bit. I guess NaCl links a lot more stuff into the exe, while emscripten just has relatively thin wrappers around Javascript / HTML 5 APIs (just a guess).

IMHO there are currently 2 main issues: there are irregular "spikes" during execution, sometimes there are slow frames, until the JS engine "catches up" and suddenly runs smooth again, maybe this is garbage collection, or the hotspot optimizer at work, don't know. It's definitely a problem for 3d games. The other problem is that it is currently impossible, or at least difficult to run stuff outside of the main thread since pthreads are not supported. It's difficult to stream and uncompress data in the background, since the decompression happens on the main loop right now. I hope we'll get some access to webworkers through a C API even if it's not pthreads compatible to offload stuff like this from the main loop.

I'll write a more comprehensive blog post next week or so about the porting experience, it definitely went smoother then expected.

kripken said...

Compile time is indeed a disadvantage with emscripten. It's built more for hackability than performance, so compared to other compilers we generate better code (it was easy to add additional js-specific optimization passes in emscripten) but it takes longer.

I was also a little surprised by the small generated code size. How big is Nebula3 in lines of code, btw? (Not surprised JS is smaller than NaCl, both for the reasons you mentioned, and also NaCl's sandboxing increases code size.)

Where do you see spikes and/or slow frames? I'm on dev versions of Firefox and Chrome and this demo runs smoothly in both. If you are testing on older versions of those browsers it might be bugs that have been fixed meanwhile. (There is some stutter during the first few seconds though, that's true and expected at least for now, might need workarounds for those if they are an issue.)

I will definitely add some web worker C API to emscripten. The big question though is how it should look, feedback is welcome. Simplest could be

call_worker(func_name, data, data_len, callback);

where data is a pointer and data_len a size, and we copy over that block of data, and callback is called when there is a response. Would this be enough, or do we need to think about something more complex?

jamesu said...

Impressive!

Also I have to admit, I thought Nebula Device was completely dead. It's great to see it still being worked on.

Floh said...

@kripken: The engine modules compiled into the demo have about 120k lines of code, all ob Nebula3 would be about half a million lines of code. I'm not sure how much is removed by dead code eliminination, but this is a pretty complete representation of the new render pipeline.

I notice some stuttering right after adding more dragons on my rather weak MacBook at home. On my desktop at work it's much better though.

Fungos Bauux said...

Quick question, how do you deal with these animations? In the nacl demo it says 54 bones / 3.1k triangles per dragon.

The animation is cached? Which extensions do you use?

Is there any numbers about bone animation comparing nacl<>js that you can show?

Also, would be nice to have a FPS counter somewhere :)

Floh said...

Animation is computed each frame for each dragon individually, so the demo is actually quite heavy on the CPU, mainly because I wanted to see how the emscripten generated code behaves with heavy math code. Each joint is fed by 3 animation curves, translation, rotation (by quaternion), and scale, and after animation evaluation, joint matrices are computed from the animation result and fed into the vertex shader. There's no caching or reuse of animation data between characters or between frames taking place. The C++ code had already been carefully optimized, and designed to play nice with CPU caches (related data lies close to each other in memory). Some of these manual optimization also seem to benefit the generated JS code. There's one very important optimization I forgot about (not yet implemented in my experimental branch where the demo is built from): quaternion SLERPs are always real slerps right now (very expensive), but 99% of the time could be replaced by a simple linear interpolation (see here: http://number-none.com/product/Hacking%20Quaternions/index.html)

The number of triangles of the character model isn't really important for JS performance, since the skinning happens in the vertex shader at full GPU speed.

Floh said...

PS: actually I just the math lib and the quaternion slerp is already using a fast path when the quaternions are close by, it's just the Windows version of the code which uses XNAMath which always uses the expensive slerp.

Fungos Bauux said...

Thank you for the reply, it was very useful.

Also, another quick question, how many bones per vertex do you use/permit on N3?

Floh said...

@Fungos: apologies for the late answer: up to 4 bones per vertex.