In Nebula3, there are 2 fundamentally different scenarios, where code runs in parallel. The first scenario is what I call "Fat Threads". A Fat Thread runs a complete subsystem (like rendering, audio, AI, physics, resource management) in a thread and is basically locked to a specific core.
The second type of thread is what I call a "Job". A job is a piece of data, and the code which processes the data, packed into a C++ object. Job objects are handed to a job scheduler, which tries to keep cores busy by distributing jobs to cores which currently have a low work load.
Now the challenge is of course, to design the overall system in a way to keep all cores evenly busy at all times. It's more likely that bursts of activity will alternate with phases of inactivity during a game frame. It is very likely that job objects will have to be created well in advance (e.g. one frame ahead), so that they are allowed to fill the gaps in the current frame where the various Fat Threads are idling.
This is where I expect a lot of experimentation and fine tuning.
Now the second challenge is to make a programmers life as simple as possible. A game application programmer shouldn't have to care all the time that he's running in a multi-threaded environment. He shouldn't be afraid to create dead-locks or overwrite another thread's data. He shouldn't have to mess around with critical sections, events and semaphores. Also, the overall engine architecture shouldn't be "fragile". Most traditional multi-threading code is fragile in a sense that race conditions may occur, or a forgotten critical section may disrupt data.
Multi-threading becomes tricky when data needs to be shared and when communication between threads needs to happen. These are the two critical areas where a solution to the fragile-code-problem must be found.
On the large scale, Nebula3 solves those 2 problems by a concept called "Parallel Nebulas". The idea is that each "Fat Thread", which runs a complete subsystem has its own minimal Nebula runtime, which consists just of the components required for this subsystem. So if a subsystem running in its own thread needs file access, it has its own file server which is completely separate from the file servers in the other Fat Threads. The advantage of this solution is, that most of the code in Nebula doesn't even have to be aware that it is running in a multi-threaded environment, since no data is shared at all between fat threads. Every minimal Nebula kernel is running in complete isolation from the other Nebula kernels. The disadvantage is of course, that some memory is wasted for redundant data, but we're talking a couple of kilobytes, not megabytes.
This data redundancy eliminates the need for fine grained locking, and frees the programmer from having to think about multi-threading safety at each line of code.
But of course, communication between Fat Threads must happen at some point, otherwise the whole concept would be useless. The idea here is to establish one and only one standard system of communication, and to make really sure that the communication system is bullet-proof and fast. This is where the messaging subsystem comes in. Communication with a Fat Thread is only possible by sending a message to it. A message is a simple C++ object which holds some data, along with setter and getter methods. With this standard means of communication, only the actual messaging subsystem code has to be thread-safe (also, access to resources associated with messages, like memory buffers, must be restricted, because they represent shared data).
This solves much of the multi-threading issues in the Fat Thread scenario, but doesn't solve anything for Job objects. Nebula3 very likely needs to put restrictions in place, what a Job object may do and must not do. The most simple approach would be to restrict jobs to do simple computations on memory buffers. That way, no complex runtime must exist for jobs (no file i/o, no access to rendering, and so on). If this isn't enough, a "job runtime environment" must be defined, which would be its own minimal, isolated Nebula runtime, just as in Fat Threads. Since a job doesn't start its own thread, but is scheduled into an existing thread from a thread pool, this shouldn't be much of a problem in terms of runtime overhead.
So far, only Nebula3's IO subsystem has been implemented in a Fat Thread as a proof-of-concept, and it is working satisfyingly. For doing traditional synchronous IO work, a Nebula3 application can simply call directly into the thread-local IO subsystem. So, for simply listing the content of a directory, or deleting a file, a simple C++ method call will do. For asynchronous IO work, a well-defined set of messages exists for common I/O operations (i.e. ReadStream, WriteStream, CopyFile, DeleteFile, etc...). Doing asynchronous IO is just a few lines of code: create the message object, fill it with data, and send the message to an IOInterface singleton. If necessary, it is both possible to wait or poll for completion of the asynchronous operation.
The good thing is, that the entire IO subsystem doesn't have a single line of multi-threading-aware code, since the various IO subsystems in the different Fat Threads are totally isolated from each other (of course, synchronization must happen at SOME point for IO operations, but that's totally left to the host operating system).