19 Aug 2012

OpenGL Screw-Ups

A list of embarrasing screw-ups I tend to waste time on (repeatedly) while writing OpenGL based rendering code. The list will surely grow as I continue working on Twiggy :P
This is generally stuff which doesn't show up with glGetError() of course.

Sampling a texture just returns black pixels in the fragment shader:

Most likely the texture is incomplete because the texture sampler states GL_TEXTURE_WRAP_S, GL_TEXTURE_WRAP_T, GL_TEXTURE_MIN_FILTER, GL_TEXTURE_MAG_FILTER hadn't been setup yet, or the FILTER states expects mipmaps but the texture has none.
Also on some platforms / drivers it is more efficient to setup those states before calling glTexImage2D().

Clearing a render target doesn't work as expected (eg the color buffer is cleared, but the depth buffer isn't):

glClear() is affected by the state of glColorMask(), glDepthMask(), and glStencilMask(). If glDepthMask() is set to GL_FALSE, the depth buffer won't be cleared even though glClear() is called with the GL_DEPTH_BUFFER_BIT set.

Also: http://www.opengl.org/wiki/Common_Mistakes

17 Aug 2012

Twiggy's Low Level Render Pipeline

Twiggy's render thread is much simpler then Nebula3's current "fat thread". Instead of running a completely autonomous graphics world in the render thread it simply accepts rendering commands from a push buffer, fed by the main thread (and maybe other threads later on). Eventually the render thread will also be extensible through some sort of plugin-modules which may implement new low-level rendering commands.

Extra care has been taken to keep data structures and their relations simple, memory granularity low (arrays-of-structs instead of arrays-of-pointers) and to keep related data close to each other in memory (to increase CPU cache efficiency).

The push buffer is at least double-buffered (more buffers are possible but probably don't make much sense since it would only increase latency when the main thread runs too far ahaead of the render thread). The render thread will starve if the main thread stops feeding commands, but that's something which needs to be fixed for special cases like loading screens, probably by adding flow-control commands to the push buffer, so that the render thread can loop over a block of rendering commands.

When doing an experimental port of N3's current render pipeline to Native Client I implemented such a push-buffer driven render pipeline because of NaCl's "call-on-main-thread" limitation. Each OpenGL call would be serialized into a command buffer, and pulled by the "Pepper Thread" where the actual GL calls would be executed. This worked - but was a terrible hack and convinced me that it isn't a good idea to feed the render thread with such low level commands (way too many commands per frame, and the feeder thread had to wait for the render thread each time a getter-function or resource-creation-function was called.

Thus the command protocol of Twiggy's render thread is higher level, but not as high-level to be completely hard-wired to a specific type of rendering technique.

Something that works very well since Nebula2 is the frame-shader system (in N2 these were called RenderPaths). Frame-shaders describe how a complete frame is rendered by dividing the frame into passes and batches. It's a nice and generic medium-level renderer architecture. Much less verbose then D3D or OpenGL, but flexible enough to implement various rendering techniques (like forward rendering on low-end platforms versus pre-light-pass rendering on platforms which have a bit more fillrate). It was quite natural to use the frame-shader vocabulary for Twiggy's render command protocol.

A twiggy render frame is built from the following commands (this excludes display setup and resource creation):

  • 1 BeginFrame
    • [UpdateProjTransform]
    • 1 UpdateViewTransform
    • 1..N BeginPass
      • 1..N BeginBatch
        • 1..N BeginInstances
          • 1..N Draw
          • 1..N DrawInstanced
          • 1 DrawFullscreenQuad
        • EndInstances
      • EndBatch
    • EndPass
  • EndFrame

BeginFrame / EndFrame: this encapsulates an entire render frame, calling EndFrame() signals the render pipeline that all rendering commands for this frame have been issued, and the result can be displayed.

UpdateProjTransform: updates the projection matrix, some rendering techniques may require to change the projection mid-frame (for instance before rendering to a shadow-map).

UpdateViewTransform: updates the view matrix, depending on the actual rendering technique used, this can be called several times per frame.

BeginPass / EndPass: a pass sets (and optionally clears) the active render target, and sets render states which should remain valid for the entire pass.

BeginBatch / EndBatch: a Batch only sets a couple of render states which should remain valid between BeginBatch/EndBatch, this is just a way to reduce redundant state switches when rendering instances. Typical batches are solid vs. alpha-blended objects for instance.

BeginInstances / EndInstances: this sets up all necessary data for a series of Draw commands for the same 3d object (geometry, textures, shaders, render states).

Draw / DrawInstances: Several ways to issue actual draw commands, the most simple version takes a world-space matrix and performs one draw call to render one instance. DrawInstances will be used for rendering several instances with a single command, preferably using hardware-instancing.

DrawFullscreenQuad: Just draws a fullscreen quad, mainly used for post-effects.

I'll keep resource management out for now, this topic is interesting enough for its own post.

What's currently missing is (1) the whole topic of dynamic resources (instance buffers, dynamic geometry, dynamic textures, ...) and (2) an elegant way to update variable shader parameters.

There will be a way to write dynamic resources from the main thread, and use them from the render thread, to prevent excessive data copying over the push buffer (it's a bit silly to move thousands of instance matrices over the push buffer, especially if the matrices are mostly static).

Next up: more info about Twiggy's resource management!

15 Aug 2012

Twiggy

I'm currently playing around with some ideas for a new rendering pipeline in Nebula3, strictly experimental stuff, based on the experience so far with Drakensang Online. I'll call it Twiggy because it's so damn slim :)

The driving force is that my focus has moved away from consoles towards more "interesting" and especially more open platforms in the past 2 years, namely the web, mobile, and desktop platforms with some sort of built-in app-shop (or good old Steam for that matter).

The way-of-thought towards the "grand vision" is basically:
  •  Selling software in stores on clunky silver disks is sooo last century...
  • Downloading and installing a complete game for half-an-hour or more before being able to play is not much better (it happens that I'm forgetting about half of the demos I'm starting to download on my 360 until I'm deleting them at some later time because I need to make room for new demos, which I will never play as well, ...)
  • Regardless of platform (web, mobile, or app-shops on desktop platforms) the experience should be true "click and play". Click a button, and start to play a few seconds later without the player wanting to start another activity in the meantime.
  • This means a small initial download of a few megabytes (5..20 MB or so), and from then on it's all about:
How much new data can we present to the user per second?

Let's be optimistic and assume somewhere between 2 and 12 Mbit/sec for the civilized areas of the world...

This means the asset data which is streamed into the game still must be small. It's no longer about how much data fits on a DVD or on Bluray, we can store as much data as we want on web servers. It's all about how fast the user eats through the data while playing the game and how this compares to the available bandwidth. Which is why data compression, sharing and reuse is so important today and the size of a Bluray disc which Sony hyped so much a few years back has become irrelevant.

It's a good thing if the game world is built from many small, similar building blocks which can be recombined as much as possible (of course without having an obviously repetitive game world). This is something which works very, very well in Drakensang Online.

With all those thoughts and the experience from the current rendering architecture (what works well and what doesn't), here are some starting points of what Twiggy will be:
  • It will initially be built on OpenGL instead of D3D, since the majority of target platforms has some flavour of OpenGL as rendering API.
  • The feature base-line will be OpenGL ES 2.0
  • The performance base-line will probably be an iPad2
  • BUT: it should scale up with additional features, and especially additional fillrate on more powerful target platforms (up to desktop GPUs)
  • GPU performance is much more important then CPU performance. Some potential target platforms (like Adobe Alchemy 2) will cross-compile C++ to some VM byte code, while providing full access to the GPUs power. It's important that a game runs smooth even with such a limited "software CPU".
  • Optimized for rendering scenes built from an extremely high number of simple 3D objects (that's where the data size savings mainly come from)
  • Ditch the "fat render thread" in favour of a very slim render thread behind a simple push buffer. The fat-render-thread design in Nebula3 works, but it is complex. It needs to run in lock-step with the main-thread, needs to transfer data back to the main-thread, and has a lot of redundant code on the render-thread and main-thread side. That's a bit too much hassle for the advantages this design provides.
  • More orthogonality throughout the render pipeline: for instance hardware-instancing should be usable with less restrictions (currently, only simple, static 3D objects can be hardware-instanced), make skinned characters and "normal" 3D objects more alike and share more features between them (e.g. using the powerful character animation system for everything, not just characters).
The way there is long, since this is a weekend-project, and I will happily throw away and rewrite large chunks of code if they don't feel right. The first step will be the new CoreGraphics2 and Resource2 subsystems, which will implement the new slim render thread, and a resource management system which is more powerful then the current one, yet with much less and simpler code. More on that later.