Gameloops for Multicore Chipsets


I have some C# code I’d like to present. It is a console application that runs a game loop, and divides that game loop up among multiple threads. The concepts driving this design I’ve shamelessly stolen from NaughtyDog’s fiber game engine, Destiny’s job-system game engine, and Unity’s latest job system (based on my understanding of them, which is entirely from GDC talks).

Alright, enough introduction - have some code:

In this console application I’m simulating 256 actors moving towards a goal position each frame. If actors overlap, they are ‘removed’ from the simulation. The simulation completes when there is 1 or less actor remaining. The simulation is an example game loop, with move and collision phases. What I’d like to discuss is not the simulation, but how the simulation’s workload is being divided up among multiple threads - and how those threads are reading and writing data - using two lists - which is based on the idea of the frame buffer we all know and love from games.

The questions I have are few: am I stupid? is this an efficient way to divide up work among the cpu cores?

Unlike the engines created by naughtyDog, bungie, and unity - I do not have control over thread to core affinity, and thus incur some context switching overhead. I’m trusting the OS to properly choose which cores are used for the different segments of work being done - and I know that’s not too efficient, but I haven’t found a reliable way to force or ensure core affinity in .Net. There are many articles I would like to link to, but I feel that would complicate what this post is trying to “be about” - questions and a discussion about game loops for multicore chipsets.

Here are a few more example that show the sim in 1, 2, and 3 threaded architectures:

Thanks for taking the time to read my post, and I hope to have an interesting discussion about game loop design for multiple cores.


1 Like

(1) Here is the single threaded (Main only) version of the sim:

This is an example of how to write a game loop on a single thread, Main() only. In this example, the actors in the sim are grouped into classes and put on a list in a public static Pool object, then iterated over, each actor getting move then collision phases. I consider the move and collision phases to be the ‘work’ being done, and I consider that each time the while loop iterates over the actor list to be a ‘frame’. So this way I can get information about how many frames it took the simulation to complete, and how long the simulation took to complete. This seems redundant, but when I move the ‘work’ being done to another thread, like in the next example, the main thread wont be doing much work and the frames will skyrocket, while the simulation times will drop as well… for this ‘baseline’ example, the sim times are around 1300ms and always finish on frame 2040.

(2) Which is exactly what happens here:

All simulation work has been moved out of Main and into a public static method of program that main calls using a thread. Frame times for completion jump into the millions, while simulation time (on my computer) drops to around 550ms. I’ve only moved the work being done to a different thread than Main, so nothing too special here - but I feel it’s important for people to understand what’s happening step by step. One thing that is important to note is that I’ve ‘locked’ the work method using a boolean. When work completes, that boolean is unlocked and work is free to run again. Main constantly runs, checking to see if that boolean is in the ready position, and if it is, then work is called on a separate thread. This was done to make sure that work happens sequentially, and not concurrently.

(3) Now, I’m going to split the work being done in half:

This is where I get a little fuzzy about how the data is flowing to the cpu cores, and how efficient this design is in reality, because I don’t know how to inspect the flow of data through the chipset to the cores. I can look at my cpu core useage, and I see more cores being used, and I see the sim times drop and the main frame times drop too - so I’m making the assumption that the OS is scheduling these threads in atleast a rational manner. In this example, the actor object is decomposed into two lists, each describing a property of the actor class that existed in previous examples. The x, y, aiType, and active fields still exist for each actor, they are just split into lists. Two lists per value. One list is used to read from, and the other list is used to write to. Then the move phase of work has been put into it’s own method of program, which locks itself using the boolean strategy from the previous example. This means there are two booleans, each locking two different program methods, that together describe the move and collision phase of the gameloop. Main now checks these two booleans to see if they are ready, and starts the move work on a thread, and the collision work on a different thread, then waits for both of them to complete. An interesting note here: move does 256 iterations of work, while collisions do 256*256 iterations of work, so collisions have a ton of work to do in comparison to move work. Because of that difference in the size of work, in the next example, I split the collision work in half, with very minimal effort:

(4) Here the collision work is divided between two threads, bringing the work being done to 3 threads, plus main, which is a design for 4 core systems:

Here I simply copied the collision work method and divided the iterators to work on the two halves of collision work. For 4, the average sim completion times were 259.3ms. For 3, the average sim completion times were 297.9ms. So I see improvement which leads me to believe “things are working as expected”.

(5) Finally, I jobify the work being done so I can easily scale this strategy up for systems with more than 4 cores available. This is done by associating indices to job instances. For this example, the collision work is divided up between two threads, and move on one thread, just like in (4), but it’s very easy to add more jobs and threads and divide that work up using just a few lines of code in c#:

Between (4) and (5), sim completion times are similar. So, based on how .Net and the OS decide to schedule the threads, this design either scales up to 4+ core systems, or it doesn’t, and that’s what I’m a bit fuzzy on. Anyone care to weigh in?

Here is an improved version of the sim, that runs itself 100 times and averages out the frame times and sim times for comparison:

Here is a video where I compare a single threaded sequential simulation vs a 3 threaded concurrent simulation, using the code linked to above:

The difference between these two tests is 15 lines of code, which I think shows how flexible this job strategy can be. It can run single threaded, if the chipset has only one core, or it can run multi-threaded if the chipset has many cores. 15 lines of code describes -both- the single and multi threaded strategies for doing the simulations work.

In the video, I can see core cpu useage go up when I switch from doing simulation work on 1 thread, to using 3 threads - creating threads using ThreadPool.QueueUserWorkItem(a => Method());

“You’re not doing anything new, everybody knows multiple threads run faster.”
I know this is true, but I have yet to find any examples showing how to do this with a game loop, as I am trying to do here. I have looked. I have started asking people for comparable examples - using global classes with read write lists to handle single or multi-threaded architectures easily - and I have not found any. I think these are interesting ideas - based on mutability and global variables / objects / classes.

“But Global variables are bad.”
If used improperly, yes, I agree. But I contend they can be used efficiently, usefully, as ways to organize data that many different parts of your program may need to talk to. It is simpler. It reduces complexity, in my experience. It is easy to mis-use, yes. It’s like a lightsaber - I can accidentally cut off my hand, but I still want it. Public Static classes actually solve a lot of problems when you need communication between a lot of different code or different threads. They can group data, or they can group functions. I’m not talking about functional programming 100% here - I’m talking about putting functions that logically fit together (like, the drawing functions) into a public static class so they can be called when needed (like, Functions.Draw()).

There are many different ways to organize and structure your programs, and the branches of evaluation that can happen through them, and I feel like we (as programmers) have chosen to ignore what potential advantages global mutability has in our program designs. In all these examples, I have intentionally only used global program state and mutable design to show that it can be done. It can be done well, and simply, and it’s quite powerful for synchronizing threads, handling data, and doing work. The next game I’m going to make will use this as it’s core, due to the levels of flexibility it provides to me.

“You can’t make a game like that, with global mutable design. It will be a mess.”
I already have:
I’ve been at it for almost 2 years now, actually.

I’d like to have that talk now, please.

1 Like