5 Mar 2008

Nebula3's Multithreaded Rendering Architecture

Alright! The Application Layer is now running through the new multithreaded rendering pipeline.

Here's how it works:

  • The former Graphics subsystem has been renamed to InternalGraphics and is now running in its own "fat thread" with all the required lower-level Nebula3 subsystems required for rendering.
  • There's a new Graphics subsystem running in the application thread with a set of proxy classes which mimic the InternalGraphics subsystem classes.
  • The main thread is now missing any rendering related subsystems, so trying to call e.g. RenderDevice::Instance() will result in a runtime error.
  • Extra care has been taken to make the overall design as simple and "fool-proof" as possible.
  • There's very little communication necessary between the main and render threads. Usually one SetTransform message for each graphics entity which has changed its position.
  • Communication is done with standard Nebula3 messages through a single message queue in the new GraphicsInterface singleton. This is an "interface singleton" which is visible from all threads. The render thread receives messages from the main thread (or other threads) and never actively sends messages to other threads (with one notable exception on the Windows platform: mouse and keyboard input).
  • Client-side code doesn't have to deal with creating and sending messages, because it talks through proxy objects with the render thread. Proxy objects provide a typical C++ interface and since there's a 1:1 relationship may cache data on the client-side to prevent a round-trip into the render thread (so there's some data duplication, but a lot less locking)
  • The Graphics subsystem offers the following public proxy classes at the moment:
    • Graphics::Display: setup and query display properties
    • Graphics::GraphicsServer: creates and manages Stages and Views
    • Graphics::Stage: a container for graphics entities
    • Graphics::View: renders a "view" into a Stage into a RenderTarget
    • Graphics::CameraEntity: defines a view volume
    • Graphics::ModelEntity: a typical graphics object
    • Graphics::GlobalLightEntity: a global, directional light source
    • Graphics::SpotLightEntity: a local spot light
  • These proxy classes are just pretty interfaces and don't do much more then creating and sending messages into the GraphicsInterface singleton.
  • There are typically 3 types of messages sent into the render thread:
    1. Synchronous messages which block the caller thread until they are processed, this is just for convenience and only exists for methods which are usually not called while the main game loop is running (like Display::GetAvailableDisplayModes())
    2. Asynchronous messages which return immediately but pass a return-value back at some later time. These are non-blocking, but the result will only be available in the next graphics frame. The proxy classes do everything possible to hide this fact by either caching values on the client side, so that no communication is necessary at all, or by returning the previous value until the graphics thread gets around to process the message).
    3. The best and most simple messages are those which don't require a return value. They are just send off by the client-side proxy and processed at some later time by the render thread. Fortunately, most messages sent during a frame are of this nature (e.g. updating entity transforms).
  • Creation of Graphics entities is an asynchronous operation, it is possible to manipulate the client-side proxy object immediately after creation even though the server-side entity doesn't exist yet. The proxy classes take care about all these details internally.
  • There is a single synchronization event per game-frame where the game thread waits for the graphics thread. This event is signalled by the graphics thread after it has processed pending messages for the current frame and before culling and rendering. This is necessary to prevent the game thread from running faster then the render thread and thus spamming its message queue. The game thread may run at a lower - but never at a higher - frame rate as the render thread.

Here's some example code from the testviewer application. It actually looks simpler then before since all the setup code has become much tighter:
using namespace Graphics;
using namespace Resources;
using namespace Util;

// setup the render thread

Ptr<GraphicsInterface> graphicsInterface = GraphicsInterface::Create();

// setup and open the display
Ptr<Display> display = Display::Create();
// ... optionally change display settings here...

That's all that is necessary to open a default display and get the render thread up and running. The render thread will now happily run its own render loop.

To actually have something rendered we need at least a Stage, a View, a camera, at least one light and a model:

// create a GraphicServer, Stage and a default View
Ptr<GraphicsServer> graphicsServer = GraphicsServer::Create();

Attr::AttributeContainer dummyStageBuilderAttrs;
Ptr<Stage> stage = graphicsServer->CreateStage(StringAtom("DefaultStage"),


Ptr<View> view = this->graphicsServer->CreateView(InternalGraphics::InternalView::RTTI,

// create a camera and make it the active camera for our view
Ptr<CameraEntity> camera = CameraEntity::Create();
camera->SetTransform(matrix44::translation(0.0f, 0.0f, 10.0f));

// create a global light source
Ptr<GlobalLightEntity> light = GlobalLightEntity::Create();

// finally create a visible model
Ptr<ModelEntity> model = ModelEntity::Create();

That's the code to setup a simple graphics world in the asynchronous rendering case. There are a few issues I still want to fix (like the InternalGraphics::InternalView::RTTI thing). The only thing that's left is to add a call to GraphicsInterface::WaitForFrameEvent() somewhere into the game-loop before updating the game objects for the next frame. The classes App::RenderApplication and App::ViewerApplication in the Render layer will actually take care of most of this stuff.

There's some brain-adaption required to work in an asynchronous rendering environment:

  • there's always a delay of up to one graphics frame until a manipulation actually shows up on screen
  • it's hard (and inefficient) to get data back from the render thread
  • it's impossible for client-threads to read, modify and write-back data within one render-frame

For the tricky application specific stuff I'm planning to implement some sort of installable client-handlers. Client threads can install their own custom handler objects which would run completely in the render-thread context. This is IMHO the only sensible way to implement application specific graphics functionality which requires exact synchronization with the render-loop.

I've had to do a few other changes to the existing code base for the asynchronous rendering to work: Mouse and keyboard events under Windows are produced by the application Windows (which is owned by the render thread), but the input subsystem lives in the game thread. Thus there needs to be a way for the render thread to communicate those input events into the main thread. I decided to derive a ThreadSafeDisplayEventHandler class (and ThreadSafeRenderEventHandler for the sake of completeness). Client threads can install those event handlers to be notified about display and render events coming out of the render-thread.

The second, bigger, change affected the Http subsystem. Previously, HttpRequestHandlers had to live in the same thread as the HttpServer, which isn't very useful anymore now that important functionality has been moved out of the main thread. So I basically moved the whole Http subsystem into its own thread as well, and HttpRequestHandlers may now be attached from any thread. There's a nice side effect now that a Http request only stalls the thread of the HttpRequestHandler which processes the request.

There's still more work to do:

  • need to write some stress-tests to uncover any thread-synchronization bugs
  • need to do performance investigations and profiling (are there any unintended synchronizations issues?)
  • thread-specific low-level optimization in the Memory subsystem as detailed in one of my previous posts
  • optimize the messaging system as much as possible (especially creation and dispatching)
  • I also want to implement some sort of method to run the rendering in the main thread, partly for debugging, partly for platforms with simple single-core CPUs

Phew, that's all for today :)


Drazen said...

Is there any release date for nebula3 beta or some kind version with all features ?

Mike said...

Yeah a beta release with all features would be great. The new Architecture and the new C++'ish design of the nebula engine is making me "wuschig" <- dunno that word in english :D

Floh said...

We're trying to do at least bi-monthly SDK releases, these have all the current features, except the platform-specific code for the game console platforms of course, as this would be illegal.

eRiX said...

wow! Thats some mouths full... :D
Really looks kinda XNA-ish.

in N2 capturing sequences always produced some nice real-world-like environment. I used that to look at effects in non >50fps circumstances. Hope the new framework will make this bug go home tho ;]
I mean.. for performance testing...

Kim Hyoun Woo said...

How about using multi-core for any subsystems of graphics system? For instance, such as using seperated core for rendering shadow or particles of the given scene.
I use quad-core system on my work. That might be good for a situation like that one for game loop, the other for rendering and another for physics and so on. But the number of core will increase by the time pass and seperating and assigning several cores for renderings subsystem might make sense on some cases. It means that using a core per graphics subsystem.

Floh said...

@kim: I'm planning to explore finer-grained multithreading within the rendering thread in the future. Things to keep in mind is that the graphics hardware is still a single resource. But things that happen before feeding the graphics chip, like view volume culling, particle updates and similar stuff may benefit from additional hardware threads. Other things that will very likely go into its own thread not related with rendering: Physics, Audio, Pathfinding. Before that I'd like to optimize the sh*t out of the current multithreaded rendering code though :)

Kim Hyoun Woo said...

Good to hear the reply. BTW isn't SLI card spreading over? Though it may need time to be common. :)