21 Jan 2007

Nebula3 Multithreading Architecture

In Nebula3, there are 2 fundamentally different scenarios, where code runs in parallel. The first scenario is what I call "Fat Threads". A Fat Thread runs a complete subsystem (like rendering, audio, AI, physics, resource management) in a thread and is basically locked to a specific core.

The second type of thread is what I call a "Job". A job is a piece of data, and the code which processes the data, packed into a C++ object. Job objects are handed to a job scheduler, which tries to keep cores busy by distributing jobs to cores which currently have a low work load.

Now the challenge is of course, to design the overall system in a way to keep all cores evenly busy at all times. It's more likely that bursts of activity will alternate with phases of inactivity during a game frame. It is very likely that job objects will have to be created well in advance (e.g. one frame ahead), so that they are allowed to fill the gaps in the current frame where the various Fat Threads are idling.

This is where I expect a lot of experimentation and fine tuning.

Now the second challenge is to make a programmers life as simple as possible. A game application programmer shouldn't have to care all the time that he's running in a multi-threaded environment. He shouldn't be afraid to create dead-locks or overwrite another thread's data. He shouldn't have to mess around with critical sections, events and semaphores. Also, the overall engine architecture shouldn't be "fragile". Most traditional multi-threading code is fragile in a sense that race conditions may occur, or a forgotten critical section may disrupt data.

Multi-threading becomes tricky when data needs to be shared and when communication between threads needs to happen. These are the two critical areas where a solution to the fragile-code-problem must be found.

On the large scale, Nebula3 solves those 2 problems by a concept called "Parallel Nebulas". The idea is that each "Fat Thread", which runs a complete subsystem has its own minimal Nebula runtime, which consists just of the components required for this subsystem. So if a subsystem running in its own thread needs file access, it has its own file server which is completely separate from the file servers in the other Fat Threads. The advantage of this solution is, that most of the code in Nebula doesn't even have to be aware that it is running in a multi-threaded environment, since no data is shared at all between fat threads. Every minimal Nebula kernel is running in complete isolation from the other Nebula kernels. The disadvantage is of course, that some memory is wasted for redundant data, but we're talking a couple of kilobytes, not megabytes.

This data redundancy eliminates the need for fine grained locking, and frees the programmer from having to think about multi-threading safety at each line of code.

But of course, communication between Fat Threads must happen at some point, otherwise the whole concept would be useless. The idea here is to establish one and only one standard system of communication, and to make really sure that the communication system is bullet-proof and fast. This is where the messaging subsystem comes in. Communication with a Fat Thread is only possible by sending a message to it. A message is a simple C++ object which holds some data, along with setter and getter methods. With this standard means of communication, only the actual messaging subsystem code has to be thread-safe (also, access to resources associated with messages, like memory buffers, must be restricted, because they represent shared data).

This solves much of the multi-threading issues in the Fat Thread scenario, but doesn't solve anything for Job objects. Nebula3 very likely needs to put restrictions in place, what a Job object may do and must not do. The most simple approach would be to restrict jobs to do simple computations on memory buffers. That way, no complex runtime must exist for jobs (no file i/o, no access to rendering, and so on). If this isn't enough, a "job runtime environment" must be defined, which would be its own minimal, isolated Nebula runtime, just as in Fat Threads. Since a job doesn't start its own thread, but is scheduled into an existing thread from a thread pool, this shouldn't be much of a problem in terms of runtime overhead.

So far, only Nebula3's IO subsystem has been implemented in a Fat Thread as a proof-of-concept, and it is working satisfyingly. For doing traditional synchronous IO work, a Nebula3 application can simply call directly into the thread-local IO subsystem. So, for simply listing the content of a directory, or deleting a file, a simple C++ method call will do. For asynchronous IO work, a well-defined set of messages exists for common I/O operations (i.e. ReadStream, WriteStream, CopyFile, DeleteFile, etc...). Doing asynchronous IO is just a few lines of code: create the message object, fill it with data, and send the message to an IOInterface singleton. If necessary, it is both possible to wait or poll for completion of the asynchronous operation.

The good thing is, that the entire IO subsystem doesn't have a single line of multi-threading-aware code, since the various IO subsystems in the different Fat Threads are totally isolated from each other (of course, synchronization must happen at SOME point for IO operations, but that's totally left to the host operating system).

20 Jan 2007

Nebula3 ref-counting and smart pointers.

C++ only offers automatic life time management for objects created on the stack. When the C++ context is left, stack objects will be destroyed automatically:

{
// create a new object on the stack
MyObject obj;

// do something with obj...

// current context is left, obj is destroyed automatically
}

When creating an object on the heap, the object has to be destroyed manually, otherwise a memory leak will result:

{
// create an object on the heap
MyObject* objPtr = new MyObject;

// do something with obj...

// need to manually destroy obj
delete obj;
}

This gets all much more complicated, when more then one "client" needs access to a C++ object, because then, ownership rules must be defined (the owner would be responsible for deleting an object, all other clients just "use" the object).

In a complex software system, this ownership management gets tricky very quickly. An elegant solution to this problem is refcounting. With refcounting, no ownership must be defined, since each "client" increments a reference count on the target object, and decrements the refcount when it no longer needs to access the object by calling a Release() method. When the refcount reaches zero (meaning, no client accesses the object any more), the object is destroyed. This fixes the multiple client scenario, but still requires the programmer to manually call the Release() method at the right time.

Smart pointer fix this second problem as well. A smart pointer is a simple templated C++ object which points to another C++ object, which manages the target refcount on creation, destruction and assignment. Other then that a smart pointer can just be used like a raw pointer, except that it fixes all the dangerous stuff associated with raw pointers.

Lets take a look at how the above code would look in Nebula3:

{
// create a C++ object on the heap
Ptr< MyObject > obj = MyObject::Create();

// do something with obj
obj-> DoSomething();

// at the end of context, the smart pointer object is destroyed
// and will release its target object
}

With smart pointers, a heap object handles exactly like a stack object, no extra care is needed for releasing the object at the right time. Smart pointers also fix the cleanup problem with arrays of pointers. If you want to create a dynamic array with raw pointers to heap objects, you must take care to delete the target objects manually before destroying the array, because a raw pointer has no destructor which could be called when the array is destroyed. By creating an array of smart pointers, this problem is solved as well. When the array is released, it will call the destructors of the contained smart pointers, which in turn will release their target objects:

{
// create an array of smart pointers
Array< Ptr< MyObject >> objArray;

// create objects and append to array
int i;
for (i = 0; i < 10; i++)
{
Ptr< MyObject > = MyObject::Create();
objArray.Append(obj);
}
// when the current context is left, the array is destroyed
// which destroys all its contained smart pointers, which in turn
// destroy their target object...
}

That's it, simple and elegant. With smart pointers you can work with heap objects just as they were stack objects!

Refcounting and smart pointers are not perfect however. They fail on cyclic dependencies (when 2 objects point to each other). There seems to be no clean and easy fix to this. Thankfully, in a well-designed software system cyclic dependencies are rarely necessary.

19 Jan 2007

Nebula3 Foundation Layer Overview

The Nebula3 Foundation Layer provides common platform abstraction services below rendering and audio. It's the lowest layer in the Nebula3 layer model and can be used on its own for applications which require an easy to use low level framework but don't need to do 3d rendering.

The Foundation Layer is already pretty much finished, with a complete set of test classes and performance benchmark classes.

The Foundation Layer consists of the following subsystems:
  • Core: this implements the basic Nebula3 object model, with support for ref-counting, smart pointers, RTTI and creating objects by class name or a class FourCC identifier.
  • Memory: provides memory management and wrapper functions.
  • Util: a collection of utility classes, mainly different types of containers, a powerful string class, a guid wrapper class and so on...
  • Timing: this subsystem offers classes for measuring time and profiling.
  • IO: An all-new powerful IO subsystem, inspired by the .NET IO framework. IO-Resources are identified by URIs, stream classes provide generic data channels, and stream readers/writers offer specialized access to streams.
  • Threading: provides low-level wrapper classes for multithreading
  • Messaging: This is an improved version of Mangalore's messaging subsystem. Messages are a standardized way to communicate between objects in the same thread, between threads, or between applications on the same machine or across a network.
  • Math: A standalone vector math library. The idea here is to make math code look much like HLSL shader code, and provide the highest performance possible across different platforms and compilers.
  • Net: The low-level networking subsystem. Provides wrapper classes for sockets, IP addresses, a simple generic client/server system, and basic HTTP support (for instance a HTTP stream class, making transparent IO over HTTP possible).
  • Scripting: Nebula3's scripting subsystem is much less obscure then Nebula2's, and easier to extend. The standard scripting language is now LUA (as compared to TCL in Nebula2), reducing the memory footprint drastically. Scripting is now no longer tied into the architecture. It is easy to disable scripting all together and reduce the memory footprint even more. It's still possible to add support for other scripting languages in Nebula3.
  • Attr: Implements the concept of dynamic attributes from Mangalore. Attributes are compile-safe key/value pairs, thus offering compile-time validation of attribute names and data types. Attributes are the foundation for the database subsystem.
  • Db: The database subsystem is an improved version of Mangalore's database subsystem, offering an abstract, performance-optimized interface to a SQL database for storing all types of application data. The standard implementation uses SQLite as its database backend, the wrapper code has been carefully tuned for best database performance.
  • Naming: Implements a hierarchical naming service for Nebula3 objects. Naming is no longer hard-coded into the base class, instead objects can be registered with the naming service at runtime, adding the additional overhead only when needed.

Nebula3 Architecture Overview

  • Nebula3 will be divided into 3 layers, where each layer builds on top of each other:
    • Foundation Layer: the lowest level layer, this provides a basic platform abstraction layer below rendering and audio. The Foundation Layer can be used as a platform abstraction for any type of application, not just real-time 3d apps.
    • Render Layer: this is the medium layer, which adds more meat to the Foundation Layer, like 3d rendering, audio, physics, and scene management.
    • Application Layer: this is the highest layer and provides a complete game framework which lets the programmer concentrate on the game logic instead of caring about all the little details necessary for being a well-mannered game application.
  • Nebula3 will integrate Mangalore completely, the subsystems from Mangalore will be integrated into different Nebula3 layers where they fit best.
  • Nebula3 will be more "C++-ish" then Nebula2, and doesn't try so much using C++ for stuff it wasn't intended for.
  • Nebula3 implements object lifetime management through refcounting and smart pointers
  • Nebula3's new object model uses a base class with 4 byte instance data instead of the 70+ bytes in Nebula2.
  • RTTI is higher performance and easier to use
  • Nebula3 still doesn't use C++ excecptions, C++ RTTI and the STL (all of these either degrade performance and/or reduce portability).
  • creating objects by class name is faster and easier to use
  • Nebula3 will be largely clib clean, no complex clib functions (like file i/o or multithreading) are allowed, removing an additional code layer.
  • Nebula3 uses LUA as its standard scripting language, instead of TCL (however, it's still possible to add support for other scripting languages)

From Nebula2 to Nebula3

Here's why Nebula3 is needed:
  • Nebula2 was mainly a rewrite of the Nebula's higher level area. The kernel and low level code was largely unchanged from Nebula1, so some of the low-level code in Nebula1 is nearly 8 years old, and it shows.
  • Some Nebula2 features which were "cool" in its days have become irrelevant (at least for Radon Labs). For instance being able to switch between OpenGL and D3D rendering at runtime, the fine-grained scripting support, etc...
  • More real-world experience shows how to better arrange certain subsystems, moving them up or down in the Nebula layer model.
  • Nebula is hard to grasp for beginners, partly caused by its somewhat esoteric object model and other design decisions. Also, experience shows that application programmers work with the high level game framework interfaces (Mangalore), and hardly work with Nebula directly. Thus, Nebula becomes more of a platform abstraction layer for the high level game framework code. Nebula3 will respect this paradigm shift.
  • Nebula2 is hard to scale upwards and downwards (modern multi-core hardware and DirectX10 on the upper end, Nintendo DS on the lower end). Now, its probably not a good idea trying to write an engine that scales unchanged from a next-gen console down to a Nintendo DS, but it should be possible to at least use a common engine core, which is slim enough for handhelds, while still being a good foundation for a next-gen engine (a small memory footprint and good performance doesn't hurt on bigger hardware either).
  • Better multithreading infrastructure. Nebula3 is designed with multi-core hardware in mind, and provides a programming model where the programmer doesn't have to care too much about running in a multithreaded environment.
  • Better networking infrastructure. Networking was bolted into Nebula2 as an afterthought. Nebula3 will offer networking support from the ground up (from direct TCP/UDP communication and direct support for HTTP and FTP on the low-end to session management and builtin multiplayer support for Nebula applications on the high end).
  • Nebula2 doesn't provide a proper high-level game framework, that's why we wrote Mangalore. This split approach caused much confusion, Nebula3 will be designed into 3 layers, where the highest layer provides a complete application framework, thus integrating Mangalore back into Nebula.

Current state of the Nebula Device

A few words about the current state of Nebula2:
  • Nebula2 is now considered stable, that means no big changes will be done to the Nebula2 code base, except for maintenance, optimizations and ports.
  • We recently added XACT support to Nebula2.
  • Development focus has shifted to Mangalore, our high-level game framework, and Nebula3, the next big version of the Nebula Device.
  • We created a stripped down Nebula2 version at Radon Labs (dubbed Nebula Embedded), and ported it to the Nintendo DS (internally called Nebula DS). There is currently a team at Radon Labs working on a title for the Nintendo DS.
  • Nebula3 will have 3-layer architecture, with the core and graphics subsystems rewritten completely, and Mangalore integrated as the high-level application layer.
  • More details on Nebula3 soon on this blog...

16 Jan 2007

Game Development Essentials: Bugtracking (or how we ended up writing our own bug tracker)

The last post of this series dealt with version control, now its time for bug tracking. In any non-trivial project it's important to have a good plan, or design document, which defines the final product, and a means to keep track on how close the current state of the project is to the final product. Also, in a team, it its important to define the work items each team member should work on at a given time, and to keep track what work items have been finished, and how this affects the current overall progress versus the planned progress. Finally, during the development of any project, hundreds of unexpected little annoying bugs creep in. These bugs must be collected, analyzed, fixed and squashed.

If this sounds terribly complicated to you, that's because it is. During a game production, many plans and deadlines are made on various levels of abstraction. A simplifed list from most abstract to most detailed may look like this:
  • Game Proposal: This is a very abstract document often used while pitching a new game to publishers. A game proposal is somewhere between 4 and 10 pages and gives a quick overview of the game. The point of the proposal is to get a publisher interested and to convice him that the game will make him a shitload of money.
  • Game Design Document: The game design document is usually created after a project has been signed during the so-called pre-production phase. The design document should be the cook-book for the project. The more details are fleshed out here, the better. The Game Design Document can be several hundred pages thick. Some features however cannot be planned in detail beforehand, especially in game projects, because it may be unclear whether a planned feature "feels right" once it is implemented. What sounds good in theory may totally stink in practice (in fact, very many features fall into this category). So its important to know what things to leave intentionally open in the design document, and its important to add "spare time" for experimentation (however, project managers generally don't like this because it messes up their Gantt charts).
  • Milestone Plan: the milestone plan structures the Game Design Document into a list of features, spread over the main production time, and breaks down the feature list into milestones. Advance payments from the publisher are usually bound to successfully delivering those milestones on time. The milestone plan's features are usually down to a granularity level like "10 creature models finished", or "The player character can pick up objects.".
  • Work Items: Here we are at the most detailed planning level. Features from the milestone plan are split further into single work items. A work item is a chunk of work assigned to a single person. For instance, we could take one of the 10 creatures from the milestone definition above, and split this into several work items, for instance:
    1. Design.
    2. Modelling.
    3. Texturing.
    4. Animation.
    5. Acceptance.
"Now, wait a minute, wasn't this post about bug tracking, not planning?" - you might rightfully ask. Well the problem is, bug tracking and planning are really 2 sides of the same medal. And that's a terribly hard problem to solve.

There are tools for planning, which all look more or less like MS Project, and there are tools for bugtracking, which are all more or less comfortable database frontends. Now we spent literally years looking for the perfect bugtracker (we already learned to not expect an integrated planning and bugtracking tool). We looked at everything that's on the market, we literally spent several man-months just evaluating what's there! In the end we did the unthinkable: We wrote our own bugtracker. And it was exactly the right thing to do.

Now, how the hell did it come to this?

(Disclaimer: for the rest of the post, when I write "bug", I really mean "work item").

The short answer is: usability. Try to enter a new bug into a tool like Bugzilla or Mantis. It is a nightmare of web forms, filled with dozens of fields which must all be clicked and filled out. Its confusing, takes up to several minutes per bug, and generally isn't a lot of fun. Now, when we were a small company with one project at a time and a single project manager, we already had a system in place which was a joy to work with and where entering a bug only took somewhere between 10 seconds and half a minute. All bugs were laid out in a single table, every line one bug. Bugs could be filtered in an instant, statistics could be generated and several users could work in parallel on the bug database. It was almost perfect. The name of that magic bugtracking tool? An Excel sheet with a few macros!

There were several shortcomings that we soon became aware of:
  • It doesn't scale well, once you hit several thousands bugs in a project and more then 10 people, shared Excel sheets started to slow down very quickly.
  • No real database backend, which made it impossible to define more complex filters.
  • Several projects had to be handled in parallel. Way too much data for a single sheet.
  • Direct access of external QA teams was impossible, we had to move bugs back and forth manually between different systems.
So this is where an evaluation odyssey began. We looked at several readily available bugtracking tools and decided to switch to Mantis, while still continuing to look for better alternatives. When we switched to Mantis an interesting effect occured: projects which used Mantis as bugtracker tool had dramatically less bugs then projects which used our traditional Excel sheet! Obviously, the Mantis-projects didn't magically obtain a better code quality. People simply entered fewer bugs because entering a bug in Mantis was more complicated then before and took much more time. Unfortunately we weren't really aware of the importance of this fact until late into the project and we had to start slipping deadlines (something unheard of so far at Radon Labs). This really taught us a lesson how important a good bugtracking tool was for the existence of the company.

We knew exactly what we wanted from a bugtracking tool, we tried for 2 years finding a better solution then Excel (imagine that!), and finally we decided to write our own tool. We hired a programmer with C# and SQL experience and had a first version of a working bugtracker which did everything exactly the way we imagined, and fixed all shortcomings of our Excel sheet within one month!

The requirements for our bug tracker are as follows:
  • entering a new bug and filtering existing bugs must be simple, intuitive and fast
  • must support multiple projects
  • must work over a DSL line
  • must support user roles (access rights, users assigned to projects and departments)
  • must support multiple projects
  • should a true SQL database backend
  • it must be possible to extract certain statistics
  • must support parent-child and follow-up dependencies between bugs
  • it must be possible to add attachments to a bug
We decided that it would be best to implement the front end in C#, since GUIs and working with databases is what C# does really well. It's also relatively easy to find programmers experienced in C# and SQL.

The default view of the bugtracker basically looks like an Excel sheet. Each line represents one bug. Every column represents a bug attribute:
  • Synopsis: a short summary of the bug (or work item)
  • Category: one of Bug, Plan, Story, Suggestion or Task
  • Priority: from 1 (most important) to 4 (least important)
  • Department: one of prog (Programming), gfx (Graphics), level (Level Design), QA (Quality Assurance) or PM (Project Management)
  • State: the current state of the bug, one of: open, fixed, duplicate, in the works, nice idea, obsolete and open
  • Creator: who entered the bug?
  • Assigned To: who's the bug assigned to?
  • Date Created: when has the bug been entered?
The are several other fields which are usually hidden, but can be configured to be visible.

Below the table view is a large text entry field for the bug description, this should contain a detailed description of the bug, and at least the steps to reproduce the bugs. The description field will also contain automatically generated log messages when the state of the bug has been changed. The attachment list contains all attached files, which can be inspected by double clicking on them. It's also possible to save the attachments to the client machine.

At the top of the table view is a row of drop down boxes, which allow quickly filtering the list of displayed bugs (for instance: show me all my open bugs of priority 1 only needs to mouse clicks). More complex filters with boolean operations can be created and saved for re-use as well very easily. There's also a pre-defined standard filter called "My Bugs", which display all open and work-in-progress bugs assigned to me.

Now, how does bug tracking work in practice?
  • Alice from QA finds a crash bug in project A (hopefully just an assert() that got triggered). After some trial and error she finds out how to reproduce a bug. She goes into the bug tracker (which is usually open all the time) and creates a new bug in project A, which adds a new empty line to the bug table. She fills out the fields right in place, she knows that the should go into programming, but isn't sure who would be working on it. That's why she assigns the bug to the lead programmer of project A, which is Bob. Since it's a crash bug she will definitely set the priority to 1. Usually she also nags Bob directly about the bug if she thinks the bug should be fixed immediately.
  • Bob checks his open bugs and looks at the bug description. From the bug description, looking at the source, and maybe checking previous version of the source code he's pretty sure that fixing the bug is programmer Carl's job.
  • Now he re-assigned the bug to Carl and also tells him to have a look at it ASAP.
  • Now Carl checks his open bugs and finds Alice' bug. Looking at the description he's pretty sure what's wrong and fixes the bug. Once he's sure the fix works by trying to reproduce the bug following Alice' repro steps he commits the changes into version control.
  • He sets the bug to fixed and tells Bob and Alice that the fix should be in the next build.
  • Now, in the "official" Life-Of-A-Bug, the fixed bug would be re-assigned to Alice automatically, and when the new build is available, Alice would have to accept the fix from Bob by checking that the bug is indeed fixed, and if that is indeed the case, set the bug to closed. At Radon Labs we omit this final step, and let the bug's life end at the fixed stage.
During production, thousands of bugs are entered into a project's bug database (remember, these are not just critical programming bugs, but all types of work items for the entire team). Often, non-critical bugs will remain unfixed for some time, duplicate bugs will be entered, or bugs are for some reason no longer reproducible. That's why it is necessary that the bug list is maintained and kept tight. Also, sometimes bug priorities must be decided "by committee", maybe the graphics department thinks some bug is highly critical, while the project manager thinks it isn't. That's why the project manager, the lead programmer and the headof's of the graphics and level-design departments gather every one or two weeks to do a bug triage. This is just a short meeting where the list of open bugs is sighted, and re-assigned, re-prioritized, or set to obsolete or duplicate. This is necessary house keeping for keeping the bug list clean, or to set the right priorities in order to hit the next milestone on time.

That's it for a basic overview of bug tracking! However, there's much more to planning and bug tracking as could be written in a single post. So maybe I'll come back to this topic at some later time.