Chapter 3 Game Engine Architecture

This chapter introduces concepts of software architectures in games in the more general context of Game Engine Architectures. Compared to the other architectures discussed in this course, game engines have a very different kind of system architecture. This lies in the very nature of games which can be described as soft real-time interactive agent-based computer simulations [13] performing a massive amount of numerical computations.

The very essence of game engines is built around a main loop, which executes as fast as possible, with the aim of not dropping below 30 iterations per second. In general, this is not a hard constraint, cannot be guaranteed and depends on the hardware and computational complexity of the game, therefore we refer to games as soft real-time. In the main loop various subsystems are executed, implementing various features required by every game. In the following we briefly discuss relevant subsystems of modern game engines, however an in-depth discussion is beyond the scope of this course and we refer interested students to the excellent book [13].

  • Rendering is responsible for displaying the graphics on the screen in either 2D or 3D. It is generally the most computationally heavy subsystem, therefore getting dedicated support from rendering hardware such as GPUs. There exists various graphics APIs such as Direct3D, OpenGL and Vulkan which provide a standardised interface for accessing rendering hardware, allowing to implement platform-independent rendering subsystems.

  • Audio and sound plays in-game background music and sounds, triggered by certain events and influenced by the physics of the environment e.g. a cave or the doppler effect by moving sound sources. There exists also deditcated hardware for this, which can be also programmed through open and standardised interfaces such as OpenAL and FMOD.

  • Physics performs collision detection and computes realistic movement of rigid bodies, cloths, ropes and other elastic materials. This subsystem can be also quite computationally heavy and has traditionally been computed on the CPU. In the mid 2000s dedicated physics hardware emerged, which then eventually got integrated into the more general GPGPUs, therefore in modern game engines GPUs also perform physics calculations alongside the CPU. When physical accuracy is paramount, such as in racing games, the physics simulation is executed in a separate thread with for example 100Hz or above, to achieve a higher update rate and not be limited by rendering performance.

  • AI is responsible for implementing various low- and high-level AI functionality such as path finding and environment scanning (low-level) or tactical behaviour (high-level). The computations of this subsystem are generally executed on the CPU.

  • Animations are required by every game, whether it is 2D or 3D. 2D animations are achieved by sprites, which are simply a number of images displayed over time. 3D animations are much more computationally heavy and complex as they are generally implemented using sekeletal animation, requiring the interaction of many transformation matrices and interpolation of many values over time. 3D animation is therefore executed to a great extent on the GPU.

  • Input subsystems capture input from the user from various devices such as mouse, keyboard, joystick or more advanced system such as Wii controllers. To achieve high performance buffered input is used where instead of waiting blocking for the input, the OS or the input library buffers input between frames which can then be queried in the next frame without delay.

  • Data Streaming deals with the streaming of data from the harddisk / optical disc into the game. As modern games have content that goes far beyond what fits into the main memory of standard hardware, it is necessary to stream data in the background and load/unload parts of a level seamleassly.

  • Networking is an important subsystem to implement multiplayer functionality. There exist a number of different approaches how to program multiplayer games, and which to choose depends very much on the genre of the game.

  • Scripting allows to implement game functionality such as triggering events upon some player action using a high level scripting language such as Lua, Python or even JavaScript. This allows for quick prototyping and faster development as there is no need to customise existing classes and recompile the whole engine.

  • Game Object Management is responsible for providing certain classes of game objects, such as vehicle, NPC, door,… and manages their relationships and interactions. This is basically the domain layer of traditional monolithic layered architectures, however in game engine middlewares they have to be quite general to support a broad range of games (for the specific genre).

It might come as a surprise but game engines are built in layers, where upper layes depend on lower layers, but not vice versa, see Figure 3.1.

Runtime game engine architecture (Figure taken from @gregory2018game).

Figure 3.1: Runtime game engine architecture (Figure taken from [13]).

Practically all modern game engines are written in C++. The reason for this choice of language is that C++ is a modern object-oriented programming language with a rich featur set, allowing low-level memory management, following a policy of zero run-time overhead for features. This makes it possible to write extremely performant code while still having the benefits of object-oriented programming. Having high performance is of fundamental importance in games for obvious reasons. Object-oriented programming is clearly the programming paradigm best suited to game engines in particular and computer simulations in general. C++ evolves over time and there exist various standards such as C++98, C++11, C++17 with more to follow. Game engine developers pick a standard which suits them and their tools best and then stick to it for at least one generation of their engine as switching between standards can be a delicate and costly undertaking. Equally C++ comes with a large number of language features of vastly different complexity, where not all features are equally useful and some are even deemed detrimental and dangerous. Therefore game engine studios deliberately restrict themselves to a specific subset of C++ and avoid usage of more problematic features such as exceptions (causes overhead), certain variants of auomatic memory management (games need to be in full control over memory), extensive use of template metaprogramming (tend to become non-portable and very difficult to understand), use of the STL and containers (again, memory management). Mastering efficient C++ requires a deep understanding of computer architectures and is therefore not an easy undertaking. It is clearly beyond the scope of this course to go into details of C++ and we therefore focus subsequently on architectural concepts rather than implementation details. If code is provided it is explained sufficiently to be understandable by a student familiar with a modern object-oriented language such as Java.

Ideally a game engine is platform independent, which is possible by following a component-based architecture, allowing for abstracting the subsystems into interfaces, providing different implementations for them and by programming a specific subsystem against standard APIs. However, a main challenge of a successful game engine is to make the most out of the hardware which requires a very fine tuned balance and interaction between the subsystems while still being maintainable and platform independent. Therefore, one of the most fundamental questions in a game engine architecture are therefore: how can we arrive at a maintainable software architecture with high cohesion and low coupline and achieve required performance? Expressed otherwise: how much performance are we willing to sacrifice for a higher software quality such as high cohesion, low coupling, leading to better maintainability and higher productivity? This question is especially important for companies which produce game engines as a licensing product for other companies which build their own games with it: it should be customisable and extensible and applicable to a wide range of games (in related genres, for example the Unreal Engine is generally used in 1st and 3rd person 3D games such as fortnite, Unreal,… and it is rather unlikely that we will see its use in real-time strategy such as Star Craft or in MOBA such as Dota 2 who normally come with their own engines), see Figure 3.2.

Game engine resuability gamut (Figure taken from @gregory2018game).

Figure 3.2: Game engine resuability gamut (Figure taken from [13]).

In the remainder of this chapter, with the exception of some aspects of rendering, we will ignore specific implementation details of various subsystems and primarily focus on the “glue” which holds them together: the game engine architecture. More specifically we are discussing how the main loop is implemented, how subsystems interact, how game object management works, all with a software-engineering mindset of a general solution with reusability in mind.

3.1 History of Game Engines

Looking at the history and development of game engines over time helps a big deal in understanding what game engines are and how they evolved into what they are today. In this history we mainly focus on the PC market as it was there were the emergence of game engine middlewares was best observed and most pronounced. Console titles seemed to have relied on custom built games which only changed in recent years with the 8th console generation (PS4, Xbox One) where the hardware distinction between PC and consoles diminished further, which has already started by the Xbox series of the 6th generation. This made it possible to easily port game engines from PC to multiplatform engines running also on a range of consoles.

We classify important steps of game engines in technological and hardware developments as they were always important transitions for game engines, influencing what was possible. However it is also important to note that this relationship was not uni directional and also game engines influence and drive technological innovation.

3.1.1 CPU only (until 1996)

In 1993 Doom I hit the market, announcing a new area in gaming history in general and in pop culture in particular. Its impact cannot be overstated as it triggered an ongoing change in the gaming history of the PC towards higher graphics fidelity, establishing the PC as a major player in gaming. It is generally considered to be one of the most significant games in video game history and is considered to be one of the greatest games of all time. It pioneered impressive 3D graphics, multiplayer gaming and modding. See Figures 3.3, 3.4 for screenshots of Doom I + II.

Doom I (1993) (Image courtesy of id Software).

Figure 3.3: Doom I (1993) (Image courtesy of id Software).

Doom II (1994) (Image courtesy of id Software).

Figure 3.4: Doom II (1994) (Image courtesy of id Software).

The game was programmed in C with some assembly for performance-critical sections. The target hardware was 486 Intel CPUs running at 33MHz with around 4-8 MByte of RAM. The game was 2.5 dimensional as it was not possible to look up or down, nor were there any multi-level structures or sloped areas. This would have required a perspective divide (see [Real-Time Computer Graphics Rendering Pipeline]) for each of the pixels rendered on screen. This would have been prohibitively expensive in those days as floating point operations were very expensive on these early CPUs.

Doom I was built on id Tech 1 which can be seen as the world first widely used middleware game engine and it was this game from which the term game engine arose. The game engine was released under open source.3

Another game engine with a lot of impact was the Build Engine, which was used for Duke Nukem 3D release in 1996. It had similar limitations as id Tech 1, however it achieved higher graphic fidelity and resolution, see Figure 3.5.

Duke Nukem 3D (1996) (Image courtesy of 3D Realms).

Figure 3.5: Duke Nukem 3D (1996) (Image courtesy of 3D Realms).

3.1.2 Graphics cards emerge (from 1996)

In 1996 Quake I (see Figure 3.6) was released. It was the first true 3D game, using innovative rendering technology. Although it still ran on a software rasteriser, with a patch it was able to use dedicated graphics cards which started to emerge at that time. These graphics cards were very simple, performing fixed functionality triangle rendering and texture filtering. Also, Quake introduced its own scripting language Quake C for more flexibility in development and content generation. Quake II (see Figure 3.7) was released the following year which still had a software renderer but also out-of-the-box support for hardware-accelarted graphics using OpenGL. The source code of both Quake I4 and Quake II5 were released under open source.

Quake I (1996) (Image courtesy of id Software).

Figure 3.6: Quake I (1996) (Image courtesy of id Software).

Quake II (1997) (Image courtesy of id Software).

Figure 3.7: Quake II (1997) (Image courtesy of id Software).

In 1998 Unreal 1 was released which set new standards in regards of graphics quality. It is the originator of the Unreal Engine family, which is one of the most important and influential game engine middleware of today. It was programmed in C++ and still had a software renderer but needed an SLI system of 2 3dfx Voodoo2 graphics cards to run on maximum quality. Unreal came with a powerful scripting language UnrealScript, which allowed for quick and flexible modding of the game. See Figure 3.8 for screenshots of Unreal 1.

Unreal (1998) (Image courtesy of Epic).

Figure 3.8: Unreal (1998) (Image courtesy of Epic).

3.1.3 Towards GPUs (from 1999)

Games from 1999 onwards marked an important development: the departure from software rendering, towards consequential use of graphics cards which started to evolve towards GPUs. The difference between a graphics card and GPU is rather blurry and the term was first coined by NVidia in 1999 with the advent of the GeForce 256. The main innovation was that geomtery transform and lighting operations could be done now on the graphics card, capable of rendering 10 million ploygons per second.6 Also it came with hardware accelerated texture combiners, allowing for unseen effects.

Milestones were Quake III Arena (see Figure 3.9) using id Tech 3, implemented still in C, and Unreal Tournament (see Figure 3.10) using the Unreal Engine.

Quake III Arena (1999) (Image courtesy of id Software).

Figure 3.9: Quake III Arena (1999) (Image courtesy of id Software).

Unreal Tournament (1999) (Image courtesy of id Software).

Figure 3.10: Unreal Tournament (1999) (Image courtesy of id Software).

The games used the respective game engine middleware, some of which have seen tremenduous success and widespread use. Especially the the influence of the Quake game engine family cannot be overstated where elements of quake engine code can be found in many modern 3D game engines, see Figure 3.11.

The family tree of Quake engines.

Figure 3.11: The family tree of Quake engines.

The Quake III Arena engine id tech 3 was released under open source.7 Special attention got the function for computing the fast inverse quare root. It is required to calculate a normalised vector, something of fundamental importance in the renderers lighting computations. As millions of these computations need to be executed per second to simulate lighting, it is a fundamentally performance critical section of the Quake III Arena engine and needs to be very fast. To achieve this, a highly specialised implementation utilising “arcane” methods of numerical methods to approximate the inverse sequare root is used. The following code snipped is directly from the open source release, for an in-depth discussion of its working we refer to Wikipedia:8

float Q_rsqrt( float number )
{
    long i;
    float x2, y;
    const float threehalfs = 1.5F;

    x2 = number * 0.5F;
    y  = number;
    i  = * ( long * ) &y;                       // evil floating point bit level hacking
    i  = 0x5f3759df - ( i >> 1 );               // what the fuck? 
    y  = * ( float * ) &i;
    y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
//  y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed

    return y;
}

Over the following years the graphic quality increased further with milestones being Half Life 2 (see Figure 3.12) using the Source Engine and Doom 3 (see Figure 3.13) using id Tech 4, where id Software finally made the step towards C++.

Half Life 2 (2004) (Image courtesy of Valve).

Figure 3.12: Half Life 2 (2004) (Image courtesy of Valve).

Doom 3 (2004) (Image courtesy of id Software).

Figure 3.13: Doom 3 (2004) (Image courtesy of id Software).

The Doom 3 id Tech 4 engine was also released under open source9 as well as the updated BFG edition.10

3.1.4 Towards General Purpose GPUs (from 2007)

The following years saw a evolution of GPUs towards more general purpose (GP) GPUs becoming mostly fully programmable through shaders (with the exception of a few fixed-functionality features). The most prominent step towards game engines using GPGPUs was probably Crysis with the CryEngine, which was the first game to run fully on DirectX 10, see Figure 3.14 which still looks stunning even by todays standards.

Crysis (2007) (Image courtesy of Crytek).

Figure 3.14: Crysis (2007) (Image courtesy of Crytek).

The Unreal Engine had a prominent representative in this area with Unreal 3, see Figure 3.15.

Unreal 3 (2007) (Image courtesy of Epic).

Figure 3.15: Unreal 3 (2007) (Image courtesy of Epic).

Id Softwares engine id Tech 5 powered the game Rage (see Figure 3.16) which marked a turning point in the history of id Software. After the tremenduous popularity and success of id Tech 3 (Quake III Arena) it took them too long to come up with a new game (Doom 3) and got overtaken by Epics Unreal Engine in terms of popularity, which still holds up as of today.

Rage (2011) (Image courtesy of id Software).

Figure 3.16: Rage (2011) (Image courtesy of id Software).

3.1.5 Towards Multicore (around 2013)

With the advent of multicore CPUs games also needed to adopt to make the most out of it. Although games in recent years have increasingly tried to make use of multicore CPUs for various tasks like AI, physics, and sound processing, it was not until around the early-to mid 2010s when games began using multicores in every aspect. This was partly due to the fact that only by then multicores became widespread enough with a standard of 4 cores, partly due to the fact that multicore game engine development is very non-trivial and partly as support of graphics drivers for multicore rendering was poor until then.

New engines marked a strong shift towards a full utilisation of all cores of a CPU to make the most of the available computing resources for high performance. An important aspect in the subsequent sections on technical details of modern game engines will be the technical details of multicore game engines. Obviously, the increasing graphic fidelity was not only allowed by a shift towards a full embrace of the multicore architectures but also by increasing GPU computing power and new features.

An engine which was built from the ground up to use multicore architectures is the in-house engine of Bungie used to power Destiny 1 + 2, see Figure 3.17.

Destiny 2 (PC, 2017) (Image courtesy of Bungie).

Figure 3.17: Destiny 2 (PC, 2017) (Image courtesy of Bungie).

Other modern engines which are fully using multicore architectures are the Unreal Engine 4 (see Figure 3.18) and id Tech 6 as used in the game Doom 2016 (see Figure 3.19).

Unreal Engine 4, Elemental Demo (2015) (Image courtesy of Epic).

Figure 3.18: Unreal Engine 4, Elemental Demo (2015) (Image courtesy of Epic).

Doom (2016) (Image courtesy of id Software).

Figure 3.19: Doom (2016) (Image courtesy of id Software).

An important focus of engines in the recent years was increased productivity for artists and reducing of intermediary processing steps when importing assets, for example pre computing visiblities, Level Of Detail, Lightmaps,…

3.1.6 Towards Photorealism (today)

The trend in the last 1-2 years has been towards photorealism by moving towards extreme geometric detail and incorporating hardware supported global illumination techniques from physically based rendering such as ray tracing. This became possible through the ever-increasing computational power of modern GPUs and hardware ray tracing.

The most prominent representative in this category is probably the Unreal Engine 5 (see Figure 3.20). Also id Software released id Tech 7, powering their Doom Eternal (see Figure 3.21), which is a strong push towards tremenduous graphic fidelity and photorealism.

Unreal Engine 5 Tech Demo (2020) (Image courtesy of Epic).

Figure 3.20: Unreal Engine 5 Tech Demo (2020) (Image courtesy of Epic).

Doom Eternal (2020) (Image courtesy of id Software).

Figure 3.21: Doom Eternal (2020) (Image courtesy of id Software).

With the tremenduous computing capabilities of modern GPUs and hardware accelerated ray tracing it became possible to rewrite renderers of old game engines to make use of real-time ray tracing, something not imaginable a few years ago. Prominent examples are Quake II (see Figure 3.22) and Minecraft (see Figure 3.23).

Quake II RTX (2019) (Image courtesy of id Software and NVidia).

Figure 3.22: Quake II RTX (2019) (Image courtesy of id Software and NVidia).

Minecraft Raytracing (2020) (Image courtesy of Mjoang).

Figure 3.23: Minecraft Raytracing (2020) (Image courtesy of Mjoang).

3.1.7 The future

The future of game engines can be expected to continue the drive towards photo-realism using physically based rendering with raytracing. Also streaming of games is likely to become important in the future such as Google Stadia and other providers, which would allow the simulation of huge, shared gaming worlds, which go way beyond what is possible now with shared world shooters such as Destiny and MMORPGs such as WoW. More and more important will be also faster prototyping and iteration cycles of artists (no programming needed) to compensate for the ever increasing effort required to create assets for modern games which due to the quality need much more detail. However the question will be whether we will see new and exciting game design and concepts emerge or will it be just old wine in new bottles?

3.2 The Game Loop

A fundamental difference of games compared to GUIs is that in the case of a GUI most of the screen content is static. When windows are moved around or things change, only the pixels are redrawn which have changed. Real-time 3D computer graphics are implemented in an entirely different way. As the camera moves about in a 3D scene, the entire contents of the screen or window change continually, so the concept of invalid rectangles no longer applies. Instead, an illusion of motion and interactivity is produced in much the same way that a movie produces it - by presenting the viewer with a series of still images in rapid succession. Obviously, producing a rapid succession of still images on-screen requires a loop. In a real-time rendering application, this is sometimes known as the render loop. At its simplest, a rendering loop is structured as follows:

while (!quit)
{
  // Update the camera transform based on interactive
  // inputs or by following a predefined path.
  updateCamera();
  
  // Update positions, orientations and any other
  // relevant visual state of any dynamic elements
  // in the scene.
  updateSceneElements();
  
  // Render a still frame into an off-screen frame
    // buffer known as the "back buffer".
  renderScene();
  
  // Swap the back buffer with the front buffer, making
  // the most recently rendered image visible
  // on-screen. (Or, in windowed mode, copy (blit) the
  // back buffer's contents to the front buffer.
    swapBuffers();
}

A game is composed of many interacting subsystems, including device I/O, rendering, animation, collision detection and resolution, optional rigid body dynamics simulation, multiplayer networking, audio, and the list goes on. Most game engine subsystems require periodic servicing while the game is running. However, the rate at which these subsystems need to be serviced varies from subsystem to subsystem. Animation typically needs to be updated at a rate of 30 or 60 Hz, in synchronization with the rendering subsystem. However, a dynamics (physics) simulation may actually require more frequent updates (e.g., 120 Hz). Higher-level systems, like AI, might only need to be serviced once or twice per second, and they needn’t necessarily be synchronized with the rendering loop at all.

In the following example, we provide the simplest way to update our engine’s subsystems - using a single loop to update everything. Such a loop is often called the game loop, because it is the master loop that services every subsystem in the engine. In this example we have a look at the pseudo code of the conceptual game loop of the famous game Pong. In pong, a ball bounces back and forth between two movable vertical paddles and two fixed horizontal walls. The human players control the positions of the paddles via control wheels. If the ball passes by a paddle without striking it, the other team wins the point and the ball is reset for a new round of play. The following pseudocode demonstrates what the game loop of a pong game might look like at its core:

void main() // Pong
{
    initGame();
    
    while (true) // game loop
    {
        readHumanInterfaceDevices();
        
        if (quitButtonPressed())
        {
            break; // exit the game loop
        }

        movePaddles();

        moveBall();

        collideAndBounceBall();
                
        if (ballImpactedSide(LEFT_PLAYER))
        {
            incremenentScore(RIGHT_PLAYER);
            resetBall();
        }
        else if (ballImpactedSide(RIGHT_PLAYER))
        {
            incrementScore(LEFT_PLAYER);
            resetBall();
        }
            renderPlayfield();
    }
}

Game loops can be implemented in a number of different ways but at their core, they usually boil down to one or more simple loops, with various embellishments. This is the very core of game engine architecture and of this chatper and in the following we discuss a few of the more common architectures.

3.2.1 Windows Message Pumps

On a Windows platform, games need to service messages from the Windows operating system in addition to servicing the various subsystems in the game engine itself. Windows games therefore contain a chunk of code known as a message pump. The basic idea is to service Windows messages whenever they arrive and to service the game engine only when no Windows messages are pending. A message pump typically looks something like this:

while (true)
{
    // Service any and all pending Windows messages.
    MSG msg;
    
    while (PeekMessage(&msg, nullptr, 0, 0) > 0)
    {
        TranslateMessage(&msg);
        DispatchMessage(&msg);
    }
    
    // No more Windows messages to process -- run one
    // iteration of our "real" game loop.
    RunOneIterationOfGameLoop();
}

One of the consequences of implementing the game loop like this is that Windows messages take precedence over rendering and simulating the game. As a result, the game will temporarily freeze whenever you resize or drag the game’s window around on the desktop.

3.2.2 Callback-Driven Frameworks

In a such a framework-based rendering engine or game engine, the main game loop has been written for us, but it is largely empty. The game programmer can write callback functions in order to “fill in” the missing details. The OGRE rendering engine is an example of a library that has been wrapped in a framework. At the lowest level, OGRE provides functions that can be called directly by a game engine programmer. However, OGRE also provides a framework that encapsulates knowledge of how to use the low-level OGRE library effectively. If the programmer chooses to use the OGRE framework, he or she derives a class from Ogre::FrameListener and overrides two virtual functions: frameStarted() and frameEnded(). As you might guess, these functions are called before and after the main 3D scene has been rendered by OGRE, respectively. The OGRE framework’s implementation of its internal game loop looks something like the following pseudocode.

while (true)
{
    for (each frameListener)
    {
        frameListener.frameStarted();
    }
    
    renderCurrentScene();
    
    for (each frameListener)
    {
        frameListener.frameEnded();
    }
    
    finalizeSceneAndSwapBuffers();
}

A particular game’s frame listener implementation might look something like this.

class GameFrameListener : public Ogre::FrameListener
{
public:
    virtual void frameStarted(const FrameEvent& event)
    {
        // Do things that must happen before the 3D scene
        // is rendered (i.e., service all game engine
        // subsystems).
        pollJoypad(event);
        updatePlayerControls(event);
        updateDynamicsSimulation(event);
        resolveCollisions(event);
        updateCamera(event);
        
        // etc.
    }
    
    virtual void frameEnded(const FrameEvent& event)
    {
        // Do things that must happen after the 3D scene
        // has been rendered.
        drawHud(event);
        
        // etc.
    }
};

3.2.3 Event-Based Updating

In games, an event is any interesting change in the state of the game or its environment. Some examples include: the human player pressing a button on the joypad, an explosion going off, an enemy character spotting the player, and the list goes on. Most game engines have an event system, which permits variousengine subsystems to register interest in particular kinds of events and to respond to those events when they occur (see Game Objects for details). A game’s event system is usually very similar to the event/messaging system underlying virtually all graphical user interfaces (for example, Microsoft Windows’ window messages or the event handling system in Java’s AWT).

Some game engines leverage their event system in order to implement the periodic servicing of some or all of their subsystems. For this to work, the event system must permit events to be posted into the future—that is, to be queued for later delivery. A game engine can then implement periodic updating by simply posting an event. In the event handler, the code can perform whatever periodic servicing is required. It can then post a new event 1/30 or 1/60 of a second into the future, thus continuing the periodic servicing for as long as it is required.

3.3 Multicore Game Loops

In recent years there has been a strong push towards multicore game engines, to best utilise the increasing number of cores available on consumer CPUs. In the following we have a closer look at how game loops can be applied to multicore CPUs.

3.3.1 Task Decomposition

To take advantage of parallel computing hardware, we need to decompose the various tasks that are performed during each iteration of our game loop into multiple subtasks, each of which can be executed in parallel. This act of decomposition transforms our software from a sequential program into a concurrent one.

There are many ways to decompose a software system for concurrency, but we can place all of them roughly into one of two categories: task parallelism and data parallelism.

  • Task parallelism naturally fits situations in which a number of different tings need to be done, and we opt to do them in parallel across multiple cores. For example, we might attempt to perform animation blending in parallel with collision detection during each iteration of our game loop, or we might submit primitives to the GPU for rendering of frame N in parallel with starting to update the state of the game world for frame N + 1.

  • Data parallelism is best suited to situations in which a single computation needs to be performed repeatedly on a large number of data elements. The GPU is probably the best example of data parallelism in action - a GPU performs millions of per-pixel and per-vertex calculations every frame by distributing the work across a large number of processing cores that operate in parallel. However, as we’ll see in the following sections, data parallelism isn’t only found on the GPU—there are plenty of tasks performed by the CPU during the game loop that can benefit from data parallelism as well.

3.3.2 One Thread per Subsystem

One simple way to decompose our game loop for concurrency would be to assign particular engine subsystems to run in separate threads. For example, the rendering engine, the collision and physics simulation, the animation pipeline, and the audio engine could each be assigned to its own thread. A master thread would control and synchronize the operations of these secondary subsystem threads, and would also continue to handle the lion’s share of the game’s high-level logic (the main game loop). On a hardware platform with multiple physical CPUs, this design would allow the threaded engine subsystems to execute in parallel with one another and with the main game loop. This is a simple example of task parallelism, see Figure 3.24.

One thread per major engine subsystem (Figure taken from @gregory2018game).

Figure 3.24: One thread per major engine subsystem (Figure taken from [13]).

There are a number of problems with the simple approach of assigning each engine subsystem to its own thread:

  • For one thing, the number of engine subsystems probably won’t match the number of cores on our gaming platform. As a result, we’ll probably end up with more threads than we have cores, and some subsystem threads will need to share a core via time-slicing.

  • Another problem is that each engine subsystem requires a different amount of processing each frame. This means that while some threads (and their corresponding CPU cores) are highly utilized each frame, others may sit idle for large portions of the frame.

  • Yet another issue is that some engine subsystems depend on the data produced by others. For example, the rendering and audio subsystems cannot start doing their work for frame N until the animation, collision and physics systems have completed their work for frame N. We cannot run two subsys tems in parallel if they depend on one another like this.

As a result of all of these problems, attempting to assign each engine subsystem to its own thread is really not a practical design.

3.3.3 Scatter / Gather

Many of the tasks performed during a single iteration of the game loop are data-intensive. For example, we may need to process a large number of ray cast requests, blend together a large batch of animation poses, or calculate world-space matrices for every object in a massive interactive scene. One way to take advantage of parallel computing hardware to perform these kinds of tasks is to use a divide-and-conquer approach. Instead of attempting to process 9000 ray casts one at a time on a single CPU core, we can divide the work into, say, six batches of 1500 ray casts each, and then execute one batch on each of the six 1 CPU cores on our PS4 or Xbox One. This approach is a form of data parallelism.

In distributed systems terminology, this is known as a scatter/gather approach, because a unit of work is divided into smaller subunits, distributed onto multiple processing cores for execution (scatter), and then the results are combined or finalized in an appropriate manner once all workloads have been completed (gather).

In the context of a game loop, one or more scatter/gather operations might be performed, at various times during one iteration of the game loop, by the “master” game loop thread, see Figure 3.25.

Scatter/gather used to parallelize selected CPU-intensive parts of the game loop (Figure taken from @gregory2018game).

Figure 3.25: Scatter/gather used to parallelize selected CPU-intensive parts of the game loop (Figure taken from [13]).

Given a dataset containing N data items that require processing, the master thread would divide the work into m batches, each containing roughly N/m elements. The master thread would then spawn m worker threads and provide each with a start index and count, allowing it to process its assigned subset of the data. Each worker thread might update the data items in-place, or (usually better) it might produce output data into a separate preallocated buffer (one per worker thread). Having successfully scattered the workload, the master thread would then be free to perform some other useful work while it waits for the worker threads to complete their tasks.

A problem is that spawning and threads is expensive, especially in game engines, as it involves a kernel call through the OS. This problem could be mitigated by using a thread pool. However, even if the thread pool solves the performance problem, a scatter/gather approach is not very flexible. Therefore, what modern game engines implement is a general-purpose system for executing units of work concurrently, across the available cores of the target hardware.

3.3.4 Job Systems

A job system is a general-purpose system for executing arbitrary units of work, typically called jobs, across multiple cores. With a job system, a game programmer can subdivide each iteration of the game loop into a relatively large number of independent jobs, and submit them to the job system for execution. The job system maintains a queue of submitted jobs, and it schedules those jobs across the available cores, either by submitting them to be executed by worker threads in a thread pool, or by some other means. In a sense, a job system is like a custom, light-weight operating system kernel, except that instead of scheduling threads to run on the available cores, it schedules jobs. Jobs can be arbitrarily fine-grained, and in a real game engine many are independent of one another. As shown in Figure 3.26, these facts help to maximize processor utilization. This architecture also scales up or down naturally to hardware with any number of CPU cores.

In a job model, work is broken down into fine-grained chunks that can be picked up by any available processor. This can help maximize processor utilization while providing the main game loop with improved flexibility (Figure taken from @gregory2018game).

Figure 3.26: In a job model, work is broken down into fine-grained chunks that can be picked up by any available processor. This can help maximize processor utilization while providing the main game loop with improved flexibility (Figure taken from [13]).

A typical job system provides a simple, easy-to-use API that looks very similar to the API for a threading library:

namespace job
{
    // signature of all job entry points
    typedef void EntryPoint(uintptr_t param);
    
    // allowable priorities
    enum class Priority
    {
        LOW, NORMAL, HIGH, CRITICAL
    };
    
    // counter (implementation not shown)
    struct Counter ... ;
    Counter* AllocCounter();
    void FreeCounter(Counter* pCounter);
    
    // simple job declaration
    struct Declaration
    {
        EntryPoint*     m_pEntryPoint;
        uintptr_t       m_param;
        Priority        m_priority;
        Counter*        m_pCounter;
    };
    
    // kick (start) a job
    void KickJob(const Declaration& decl);
    void KickJobs(int count, const Declaration aDecl[]);
    
    // wait for job to terminate (for its Counter to become zero)
    void WaitForCounter(Counter* pCounter);
    
    // kick jobs and wait for completion
    void KickJobAndWait(const Declaration& decl);
    void KickJobsAndWait(int count, const Declaration aDecl[]);
}

Such a job system can be built around a pool of worker threads, where it is a good idea to spawn one thread for each CPU that is present in the target machine, and use affinity to lock each thread to one core. A worker thread might look like this:

namespace job
{
    void* JobWorkerThread(void*)
    {
        // keep on running jobs forever...
        while (true)
        {
            Declaration declCopy;
            
            // wait for a job to become available
            pthread_mutex_lock(&g_mutex);
            while (!g_ready)
            {
                pthread_cond_wait(&g_jobCv, &g_mutex);
            }
            
            // copy the JobDeclaration locally and
            // release our mutex lock
            declCopy = GetNextJobFromQueue();
            pthread_mutex_unlock(&g_mutex);
            
            // run the job
            declCopy.m_pEntryPoint(declCopy.m_param);
            
            // job is done! rinse and repeat...
        }
    }
}

Such a thread pool-based job system has a limitation. Consider the following job that updates the state of an AI-driven NPC:

void NpcThinkJob(uintparam_t param)
{
    Npc* pNpc = reinterpret_cast<Npc*>(param);
    
    pNpc->StartThinking();
    pNpc->DoSomeMoreUpdating();
    
    // ...
    // now let's cast a ray to see if we're aiming
    // at anything interesting -- this involves
    // kicking off another job that will run on
    // a different code (worker thread)
    RayCastHandle hRayCast = CastGunAimRay(pNpc);
    
    // the results of the ray cast aren't going to
    // be ready until later this frame, so let's
    // go to sleep until it's ready
    WaitForRayCast(hRayCast);
    
    // zzz...
    
    // wake up!
    
    // now fire my weapon, but only if the ray
    // cast indicates that we are aiming at an
    // enemy
    pNpc->TryFireWeaponAtTarget(hRayCast);
    
    // ...
}

Unfortunately this kind of job wouldn’t work if we tried to run it within the simple job system that we described in the previous section. The problem is the call to WaitForRayCast(). In our simple thread-pool-based job system, every job must run to completion once it starts running. It cannot “go to sleep” waiting for the results of the ray cast, allowing other jobs to run on the worker thread, and then “wake up” later, when the ray cast results are ready.

This limitation arises because in our simple system, each running job shares the same call stack as the worker thread that invoked it. To put a job to sleep, we would need to effectively context switch to another job. That would involve saving off the outgoing job’s call stack and registers, and then overwriting the worker thread’s call stack with the incoming job’s call stack. There’s no simple way to do that when using such a simple job execution method. There are various solutions to this problem:

  • Jobs as Coroutines allow to execute code asynchronously by allowing multiple entry points for suspending and resuming executino at certain points. When calling to other coroutines the calling coroutine yields to the other one and return and continue from where it left off at some later time when another coroutine yields control back to it.

  • Jobs as Fibers allow for a a similar approach as coroutines but is more flexible and explicit in its yielding. A fiber (also known as green threads, protothreads, user-level threads) can be understood as a lightweight thread that use cooperative instead of preemptive multitasking. Therefore a fiber never gets interrupted and must explicitly yield to another fiber to run. The benefit of fibers is that they can migrate from one thread to another, making them well suited for use in implementing a job system.

If we implement our job system with coroutines or fibers, we have the ability to put a job to sleep (saving off its execution context), and to wake it back up at a future time (thus restoring its execution context). This in turn permits us to implement a join function for our job system - a function that causes the calling job to go to sleep, waiting until one or more other jobs have completed executing. To synchronise on jobs, the job counter concept is used: whenever a job is kicked, it can optionally be associated with a counter which can then be synchronised on using a spin lock, allowing to wait for a batch of jobs. This is a quite efficient approach as it involves just checking if a counter is zero, instead of checking individual jobs.

3.3.5 Naughty Dogs Jobsystem

The job system used by Naughty Dog on The Last of Us: Remastered, Uncharted 4: A Thief’s End and Uncharted: The Lost Legacy largely follows the design of the hypothetical job systems we’ve discussed thus far. See Figures 3.27 and 3.28 for screenshots of games run by this jobsystem.

Uncharted 4: A Thiefs End (2016) (Image courtesy of Naughty Dog).

Figure 3.27: Uncharted 4: A Thiefs End (2016) (Image courtesy of Naughty Dog).

The Last of Us: Remastered (2014) (Image courtesy of Naughty Dog).

Figure 3.28: The Last of Us: Remastered (2014) (Image courtesy of Naughty Dog).

It is based on fibers (rather than thread pools or coroutines). It makes use of spin locks and also provides a special job mutex that can put a job to sleep while it waits for a lock. It uses counters rather than job handles to implement a join operation.

When the system first boots up, the main thread converts itself to a fiber in order to enable fibers within the process as a whole. Next, job worker threads are spawned, one for each of the six cores available to developers on the PS4 (there are 8 cores however the 7th is basically reserved for the OS and the 8th is a ‘backup’ which is off limits and in there in case one of the cores is faulty from production). These threads are each locked to one core via their CPU affinity settings, so we can think of these worker threads and their cores as being roughly synonymous

Fiber creation is slow on the PS4, so a pool of fibers is pre-spawned, along with memory blocks to serve as each fiber’s call stack. When jobs are kicked, their declarations are placed onto a queue. As cores/worker threads become free (as jobs terminate), new jobs are pulled from this queue and executed. A running job can also add more jobs to the job queue.

To execute a job, an unused fiber is pulled from the fiber pool, and the worker thread performs a switch to the fiber to start the job running. When a job waits on a counter, the job is put to sleep and its fiber (execution context) is placed on a wait list, along with the counter it is waiting for. When this counter hits zero, the job is woken back up so it can continue where it left off.

The system has 160 fibers, which means that there can be 160 jobs active (not physically running) which have not completed yet. Further, there are 3 global job queues (one each for low/normal/high priority) with the requirement that if any thread enqueues a high priority job, it should be the next job running - there is no job stealing, therefore sacrificing throughput in favour of latency, see Figure 3.29.

Naughty Dogs job system.

Figure 3.29: Naughty Dogs job system.

If a fiber is executing, it will pull a job from the job queue, therefore jobs always execute inside a fiber context.

Figure 3.30: If a fiber is executing, it will pull a job from the job queue, therefore jobs always execute inside a fiber context.

Jobs can add more jobs to the three global job queues, which are always pushed to the end.

Figure 3.31: Jobs can add more jobs to the three global job queues, which are always pushed to the end.

If a job is running into a condition where it needs to wait for a synchronisation counter, which does not have the valut it expects, the fiber is moved to a wait list associated with the counter waited on. In this case the yellow job wants the counter to become 0.

Figure 3.32: If a job is running into a condition where it needs to wait for a synchronisation counter, which does not have the valut it expects, the fiber is moved to a wait list associated with the counter waited on. In this case the yellow job wants the counter to become 0.

Now a new fiber is used to allow another job to begin executing. The stack of the waiting job is preserved in the fiber.

Figure 3.33: Now a new fiber is used to allow another job to begin executing. The stack of the waiting job is preserved in the fiber.

The new fiber takes the next (blue) job from the job queue and begins executing it.

Figure 3.34: The new fiber takes the next (blue) job from the job queue and begins executing it.

When a job completes it will decrement an optionally associated synchronisation counter and wake up any waiting jobs. In this case the yellow job was waiting on the blue job, therefore creating a dependency. The counter gets decremented, the yellow job is now allowed to run.

Figure 3.35: When a job completes it will decrement an optionally associated synchronisation counter and wake up any waiting jobs. In this case the yellow job was waiting on the blue job, therefore creating a dependency. The counter gets decremented, the yellow job is now allowed to run.

When a job is resumed, the worker thread needs to resume an existing call stack by swapping in the fiber from the wait list (basically just swapping registers and the call stack).

Figure 3.36: When a job is resumed, the worker thread needs to resume an existing call stack by swapping in the fiber from the wait list (basically just swapping registers and the call stack).

Then the thread switches to the waiting fiber and resume execution.

Figure 3.37: Then the thread switches to the waiting fiber and resume execution.

The worker threads are locked to cores, wo avoid context switches and unwanted core switching of thredas which can cause unwanted hickups. The problem is as follows: all 6 worker threads, locked to the cores are purely computational, therefore running for a very long time uninterrupted. If now a kernel thread (OS) is interrupted and needs to run again, it will interrupt one of the 6 worker threads as they have been running for a very long time already, causing a hickup in the whole computational flow of the game.

The job system executes around 800-1000 jobs per frame in The Last of Us: Remastered. Further, everything is a job: game object updates, animation updates, ray casts, command buffer generation. The exceptions are I/O threads (sockets, file I/O, system calls), which are always blocking calls, where the kernel blocks the calling thread and puts it to sleep. They are handled by system threads which sleep most of the time, waiting for the I/O event to be fulfilled. If this I/O event happens the thread wakes up and puts the data in some global buffer and kicks off a job on the job system to process the data.

This job system allowed the developers to swap out synchronous work into asynchronous. Consider the following code which animates 100 objects, which has been done in their pre-job system game engine as follows:

Object g_Objects[100];

void AnimateAllObjects()
{
    for (int objIndex = 0; objIndex < 100; ++objIndex)
    {
        AnimateObject(&g_Objects[objIndex ]);
    }
}

void AnimateObject(void* pObj)
{
  ...
}

In the new job system, this can be achieved using 100 jobs:

Object g_Objects[100];

void AnimateAllObjects()
{
  // created on the stack!
    JobDecl jobDecls[100];
    for (int jobIndex = 0; jobIndex < 100; ++jobIndex)
    {
      // define jobs
        jobDecls[jobIndex] = JobDecl(AnimateObjectJob, &g_Objects[jobIndex]);
    }
    Counter* pJobCounter = NULL;
    // run jobs on the job system
    RunJobs(&jobDecls, 100, &pJobCounter);
    // wait for finishing of the jobs
    WaitForCounterAndFree(pJobCounter, 0);
}

JOB_ENTRY_POINT(AnimateObjectJob)
{
...
}

Benefits of their new job system:

  • Extremely easy to jobify existing gameplay updates, took the team only a few weeks.
  • Easy to synchronised (WaitForCounter).
  • Lightweight due to the use of fibers.

Drawbacks of their new job system:

  • Not allowed to use any system synchronisation primitives (mutex, semaphores,…) as it would block the whole worker thread and not only the current fiber (because it is cooperative).
  • Synchronisation is done on the hardware level, where Atomic Spin Locks are used almost everywhere. If locks are held longer, special job mutexes are used (this puts the current job to sleep instead of spin lock).

The critial path of such a job system defines the lower bound of FPS. Despite a lot of optimisation and clever jobifying, 2 months before shipping The Last of Us: Remastered Naughty Dog ran into a critical path of 25ms, which means it is around 40 FPS but still far away from the target of 60 FPS on PS4. They are ultimately CPU bound, however they realised that there are a lot of idle time on CPU cores, see Figure 3.38.

Idle cores in Naughty Dogs job system.

Figure 3.38: Idle cores in Naughty Dogs job system.

At this point their engine ran game logic first and then the rendering logic, creating a clear separation, see Figure 3.39.

Naughty Dogs game engine ran game logic first, then rendering logic. Green color indicates game logic, red color indicates rendering logic. Green background marks the area where the time per frame is below 16.66 ms (at least 60 FPS), blue background marks the area where the time per frame is above 16.66ms (below 60 FPS).

Figure 3.39: Naughty Dogs game engine ran game logic first, then rendering logic. Green color indicates game logic, red color indicates rendering logic. Green background marks the area where the time per frame is below 16.66 ms (at least 60 FPS), blue background marks the area where the time per frame is above 16.66ms (below 60 FPS).

They could show that it is theoretically possible to run at 60 FPS as ~100ms of work were done on the CPU for a single frame, and split by 6 working cores this amounts to 16.66 ms, which is exactly 60 FPS. The solution was to run game and render logic at the same time but processing different frames. As a consequence, Naughty Dog defined a frame as a piece of data that is processed and ultimately displayed on screen. The emphasis is piece of data. Note that the definition is completely free of any measure of time. Frames per second is therefore a measure of throughput not of time.

Processing game logic of frame 1.

Figure 3.40: Processing game logic of frame 1.

When fame 1 game logic has finished, it is passed on to the rendering logic and game logic for frame 2 is computed.

Figure 3.41: When fame 1 game logic has finished, it is passed on to the rendering logic and game logic for frame 2 is computed.

Finally, frame 1 is then executed on the GPU, while the game logic of a 3rd frame is computed and the 2nd frame is in the rendering logic.

Figure 3.42: Finally, frame 1 is then executed on the GPU, while the game logic of a 3rd frame is computed and the 2nd frame is in the rendering logic.

This new design led to reduced complexity in the engine, with very few locks needed due to parallelism. Locks are now only used to synchronised within heavily jobified stage updates instead of synchronisation between stages. The new design resulted in the utilisation as seen in Figure 3.43, with a critical path of 15.5ms, running on 64 FPS on average.

The core utilisation after the redesign. Green color indicates game logic, red color indicates rendering logic. Green background marks the area where the time per frame is below 16.66 ms (at least 60 FPS), blue background marks the area where the time per frame is above 16.66ms (below 60 FPS).

Figure 3.43: The core utilisation after the redesign. Green color indicates game logic, red color indicates rendering logic. Green background marks the area where the time per frame is below 16.66 ms (at least 60 FPS), blue background marks the area where the time per frame is above 16.66ms (below 60 FPS).

The details of implementing such a interleaved engine are quite complex boiling down to the question of how to manage memory when there are 3 frames in the pipeline. Basically, each frame received a exclusive copy of some frame-specific data, upon which the subsystems can be operate for this frame, arriving at a real parallel solution. However, memory management in game engines is highly involved as due to performance implication dynamic memory allocation (anything like new or malloc) is avoided under all costs. We will not go into details of this but refer interested students to [13].

3.4 Time

Games are real-time, dynamic, interactive computer simulations. As such, time plays an incredibly important role in any electronic game. There are many different kinds of time to deal with in a game engine - real time, game time, the local timeline of an animation, the actual CPU cycles spent within a particular function, and the list goes on. Every engine system might define and manipulate time differently. We must have a solid understanding of all the ways time can be used in a game. This problem goes to the very heart of game engine architectures and in the following, we’ll take a look at how real-time, dynamic simulation software works and explore the common ways in which time plays a role in such a simulation.

In game programming, it can be extremely useful to think in terms of abstract timelines. A timeline is a continuous, one-dimensional axis whose origin (t = 0) can lie at any arbitrary location relative to other timelines in the system. A timeline can be implemented via a simple clock variable that stores absolute time values in either integer or floating-point format.

3.4.1 Real Time

We can think of times measured directly via the CPU’s high-resolution timer register as lying on what we’ll call the real timeline. The origin of this timeline is defined to coincide with the moment the CPU was last powered on or reset. It measures times in units of CPU cycles (or some multiple thereof), although these time values can be easily converted into units of seconds by multiplying them by the frequency of the high-resolution timer on the current CPU.

3.4.2 Game Time

We needn’t limit ourselves to working with the real timeline exclusively. We can define as many other timeline(s) as we need in order to solve the problems at hand. For example, we can define a game timeline that is technically independent of real time. Under normal circumstances, game time coincides with real time. If we wish to pause the game, we can simply stop updating the game timeline temporarily. If we want our game to go into slow motion, we can update the game clock more slowly than the real-time clock. All sorts of effects can be achieved by scaling and warping one timeline relative to another.

3.4.3 Local and Global Timelines

We can envision all sorts of other timelines. For example, an animation clip or audio clip might have a local timeline, with its origin (t = 0) defined to coincide with the start of the clip. The local timeline measures how time progressed when the clip was originally authored or recorded. When the clip is played back in-game, we needn’t play it at the original rate. We might want to speed up an animation, or slow down an audio sample. We can even play an animation backwards by running its local clock in reverse.

Any one of these effects can be visualized as a mapping between the local timeline and a global timeline, such as real time or game time. To play an animation clip back at its originally authored speed, we simply map the start of the animation’s local timeline (t = 0) onto the desired start time (\(\tau = \tau_{start}\)) along the global timeline, see Figure 3.44.

Playing an animation clip can be visualized as mapping its local timeline onto the global game timeline (Figure taken from @gregory2018game).

Figure 3.44: Playing an animation clip can be visualized as mapping its local timeline onto the global game timeline (Figure taken from [13]).

To play an animation clip back at half speed, we can imagine scaling the local timeline to twice its original size prior to mapping it onto the global timeline. To accomplish this, we simply keep track of a time scale factor or playback rate R, in addition to the clip’s global start time \(\tau_{start}\), see Figure 3.45.

Animation playback speed can be controlled by simply scaling the local timeline prior to mapping it onto the global timeline (Figure taken from @gregory2018game).

Figure 3.45: Animation playback speed can be controlled by simply scaling the local timeline prior to mapping it onto the global timeline (Figure taken from [13]).

A clip can even be played in reverse, by using a negative time scale (R < 0), see Figure 3.46.

Playing an animation in reverse is like mapping the clip to the global timeline with a time scale of R = −1 (Figure taken from @gregory2018game).

Figure 3.46: Playing an animation in reverse is like mapping the clip to the global timeline with a time scale of R = −1 (Figure taken from [13]).

3.4.4 Measuring Time

The frame rate of a real-time game describes how rapidly the sequence of still 3D frames is presented to the viewer. The unit of Hertz (Hz), defined as the number of cycles per second, can be used to describe the rate of any periodic process. In games and film, frame rate is typically measured in frames per second (FPS), which is the same thing as Hertz for all intents and purposes. Films traditionally run at 24 FPS. Modern games are typically targeted at rendering at 30 or 60 FPS.

The amount of time that elapses between frames is known as the frame time, time delta or delta time. This last term is commonplace because the duration be tween frames is often represented mathematically by the symbol \(\Delta t\). If a game is being rendered at exactly 30 FPS then its delta time is 1/30 of a second, or 33.3 ms (milliseconds). At 60 FPS, the delta time is half as big, 1/60 of a second or 16.6 ms. To really know how much time has elapsed during one iteration of the game loop, we need to measure it. Milliseconds are a common unit of time measurement in games. For example, we might say that the animation system is taking 4 ms to run, which implies that it occupies about 12% of the entire frame (4/33.3 ≈ 0.12 ). Other common units include seconds and machine cycles.

3.4.5 From Frame Rate to Speed

Imagine that we want to make a spaceship fly through our game world at a constant speed of 40 meters per second (or in a 2D game, we might specify this as 40 pixels per second). One simple way to accomplish this is to multiply the ship’s speed v (measured in meters per second) by the duration of one frame ∆t (measured in seconds), yielding a change in position \(\Delta x = v \Delta t\) (which is measured in meters per frame). This position delta can then be added to the ship’s current position \(x_1\), in order to find its position next frame: \(x_2 = x_1 + \Delta x = x_1 + v \Delta t\).

Early games ignored \(\Delta t\) altogether and instead specified the speeds of objects directly in terms of meters per frame. In other words, they were specifying object speeds in terms of \(\Delta x = v \Delta t\) instead in terms of v. The consequence was that the perceived speeds of the objects in these games were entirely dependent upon the frame rate that the game was actually achieving on a particular piece of hardware. If the hardware was as expected everything was fine, if the hardware was faster than expected the game ran faster and eventually became unplayable on modern hardware.

Computing the new position based on the current position multiplied by \(\Delta t\) is actually a simple form of numerical integration known as the explicit Euler method. It works well as long as the speeds of our objects are roughly constant. To handle variable speeds, we need to resort to somewhat more complex integration methods. But all numerical integration techniques make use of the elapsed frame time \(\Delta t\) in one way or another. So it is safe to say that the perceived speeds of the objects in a game are dependent upon the frame duration, \(\Delta t\). Hence a central problem in game programming is to determine a suitable value for \(\Delta t\).

3.4.6 Determining \(\Delta t\)

To make games CPU-independent, \(\Delta t\) must be measured in some way, rather than simply ignoring it, as old games did. Measuring \(\Delta t\) is quite straightforward. We simply read the value of the CPU’s high-resolution timer twice—once at the beginning of the frame and once at the end. Then we subtract, producing an accurate measure of \(\Delta t\) for the frame that has just passed. This delta is then made available to all engine subsystems that need it, either by passing it to every function that we call from within the game loop or by storing it in a global variable or encapsulating it within a singleton class of some kind.

Although this technique is used by most game engines, it has one big problem. We are using the measured value of \(\Delta t\) taken during frame k as an estimate of the duration of the upcoming frame (\(k + 1\)) . This isn’t necessarily very accurate. Something might happen next frame that causes it to take much more time (or much less) than the current frame. We call such an event a frame-rate spike.

Using last frame’s delta as an estimate of the upcoming frame can have some very real detrimental effects. For example, if we’re not careful it can put the game into a “vicious cycle” of poor frame times. Let’s assume that our physics simulation is most stable when updated once every 33.3 ms (i.e., at 30 Hz). If we get one bad frame, taking say 57 ms, then we might make the mistake of stepping the physics system twice on the next frame, presumably to “cover” the 57 ms that has passed. Those two steps take roughly twice as long to complete as a regular step, causing the next frame to be at least as bad as this one was, and possibly worse. This only serves to exacerbate and prolong the problem.

3.4.6.1 Running Average

It is true that game loops tend to have at least some frame-to-frame coherency. If the camera is pointed down a hallway containing lots of expensive-to-draw objects on one frame, there’s a good chance it will still be pointed down that hallway on the next. Therefore, one reasonable approach is to average the frame-time measurements over a small number of frames and use that as the next frame’s estimate of ∆t. This allows the game to adapt to a varying frame rate, while softening the effects of momentary performance spikes. The longer the averaging interval, the less responsive the game will be to a varying frame rate, but spikes will have less of an impact as well.

3.4.6.2 Governing the Frame Rate

Rather than trying to guess at what next frame’s duration will be, we can instead attempt to guarantee that every frame’s duration will be exactly 33.3 ms (or 16.6 ms if we’re running at 60 FPS). To do this, we measure the duration of the current frame as before. If the measured duration is less than the ideal frame time, we simply put the main thread to sleep until the target frame time has elapsed. If the measured duration is more than the ideal frame time, we must “take our lumps” and wait for one more whole frame time to elapse. This is called frame-rate governing.

Keeping the frame rate consistent can be important for a number of reasons. Some engine systems, such as the numerical integrators used in a physics simulation, operate best when updated at a constant rate.

3.5 Game Objects

So far we have established that a game engine is a complex, layered software system built on top of the hardware, drivers and operating system of the target machine. We have shown how subsystems conceptually interact with each other through the game loop, however this alone does not make a game. What is needed are a representation of gameplay mechanics, which are usually defined as a set of rules that govern the interactions between the various entities in the game. More specifically it defines:

  • Objectives of the player(s).
  • Criteria for success and failure.
  • Abilities of the players characters.
  • Non-player entities (NPC) number and types.
  • Overall flow of the gaming experience as a whole.

In general, game engines divide their game world into static and dynamic elements, see Figure 3.47.

A game world from Uncharted: The Lost Legacy (© 2017/TM SIE. Created and developed by Naughty Dog, PlayStation 4) showing static and dynamic elements (Figure taken from @gregory2018game).

Figure 3.47: A game world from Uncharted: The Lost Legacy (© 2017/TM SIE. Created and developed by Naughty Dog, PlayStation 4) showing static and dynamic elements (Figure taken from [13]).

  • Static elements are usually just static world elements, made out of a giant triangle mesh or broken up into discrete pieces when a game takes place in a very large virtual world, called world chunks. The player can usually see only a handful of chunks at any given moment while playing the game, see Figure 3.48.
Many game worlds are divided into chunks for various reasons, including memory limitations, the need to control the flow of the game through the world, and as a division-of-labor mechanism during development (Figure taken from @gregory2018game).

Figure 3.48: Many game worlds are divided into chunks for various reasons, including memory limitations, the need to control the flow of the game through the world, and as a division-of-labor mechanism during development (Figure taken from [13]).

  • Dynamic elements are the elements that change over time. The term game state refers to the current state of all dynamic game world elements taken as a whole. Examples for dynamic elements are characters, vehicles, floating power-ups,… Most 3D games consist of a relatively small number of dynamic elements moving about within a relatively large static background area. The dynamic elements of a game are usually more expensive than the static elements in terms of CPU resources, so most 3D games are constrained to a limited number of dynamic elements.

The dynamic elements of a game are usually designed in an object-oriented fashion. We’ll use the term game object (GO) to refer to virtually any dynamic element within a game world. However, this terminology is by no means standard within the industry. Game objects are commonly referred to as entities, actors or agents, and the list of terms goes on. Game engines provide so called game object models to describe the facilities provided by a game engine in order to permit the dynamic entities in the virtual game world to be modeled and simulated:

  • A game’s object model is a specific object-oriented programming interface intended to solve the particular problem of simulating the specific set of entities that make up a particular game.

  • Additionally, a game’s object model often extends the programming language in which the engine was written. If the game is implemented in a non-object-oriented language like C, object-oriented facilities can be added by the programmers. And even if the game is written in an object-oriented language like C++, advanced features like reflection, persistence and network replication are often added. A game object model sometimes melds the features of multiple languages. For example, a game engine might combine a compiled programming language such as C or C++ with a scripting language like Python, Lua or Pawn and provide a unified object model that can be accessed from either language.

3.5.1 Runtime Game Object Model

The runtime game object model is an implementation of the abstract game object model advertised to the game designers via a world editor. It typically provides most of the following features:

  • Spawning and destroying game objects dynamically. The dynamic elements in a game world often need to come and go during gameplay.

  • Linkage to low-level engine systems. Every game object has some kind of linkage to one or more underlying engine systems. Most game objects are visually represented by renderable triangle meshes. Some have particle effects. Many generate sounds. Some animate. Many have collision, and some are dynamically simulated by the physics engine. One of the primary responsibilities of the gameplay foundation system is to ensure that every game object has access to the services of the engine systems upon which it depends.

  • Real-time simulation of object behaviors. At its core, a game engine is a real-time dynamic computer simulation of an agent-based model. This means that the game engine needs to update the states of all the game objects dynamically over time. The objects may need to be updated in a very particular order, dictated in part by dependencies between the objects, in part by their dependencies on various engine subsystems, and in part because of the interdependencies between those engine subsystems themselves.

  • Finite state machine support. Many types of game objects are best modeled as finite state machines (FSM). Some game engines provide the ability for a game object to exist in one of many possible states, each with its own attributes and behavioral characteristics.

There are also a number of other features such as unique object ids, game object queries, game object references, network replication, object persistence. We ignored them here as they are beyond the focus of this course and we refer interested students to [13].

In the following we focus on object-centric and object-oriented runtime game object models as they are most common in game engine architectures. There are also property-centric models which we ignore here as they are beyond the scope of this course. Also, we think that in the domain of games, object-oriented programming is superior as computer simulations (which games ultimately are) are one of the two prime use case for object-oriented programming11

It’s natural to want to classify game object types taxonomically. This tends to lead game programmers toward an object-oriented language that supports inheritance. A class hierarchy is the most intuitive and straightforward way to represent a collection of interrelated game object types. So it is not surprising that the majority of commercial game engines employ a class hierarchy based technique. Figure 3.49 shows a simple class hierarchy that could be used to implement the game Pac-Man.

A hypothetical class hierarchy for the game Pac-Man (Figure taken from @gregory2018game).

Figure 3.49: A hypothetical class hierarchy for the game Pac-Man (Figure taken from [13]).

This is just a hypothetical example, but it illustrates the basic ideas that underlie most game object class hierarchies—namely that common, generic functionality tends to exist at the root of the hierarchy, while classes toward the leaves of the hierarchy tend to add increasingly specific functionality. Another example of a class hierarchy is shown in Figure 3.50, which shows the class hierarchy of the Unreal engine.

An excerpt from the game object class hierarchy in the Unreal engine (Figure taken from @gregory2018game).

Figure 3.50: An excerpt from the game object class hierarchy in the Unreal engine (Figure taken from [13]).

Object-oriented design applies here equally as in other areas, however care must be taken to avoid a too idealistic design as all indirections (delegation, composition) and virtual method calls can lead to reduced performance. One possible implementation of the component creation and destruction logic for this kind of hierarchy is shown below:

class GameObject
{
protected:
    // My transform (position, rotation, scale).
    Transform m_transform;
    
    // Standard components:
    MeshInstance* m_pMeshInst;
    AnimationController* m_pAnimController;
    RigidBody* m_pRigidBody;

public:
    GameObject()
    {
        // Assume no components by default.
        // Derived classes will override.
        m_pMeshInst = nullptr;
        m_pAnimController = nullptr;
        m_pRigidBody = nullptr;
    }

    ~GameObject()
    {
        // Automatically delete any components created by
        // derived classes. (Deleting null pointers OK.)
        delete m_pMeshInst;
        delete m_pAnimController;
        delete m_pRigidBody;
    }
    
    // ...
};
    
class Vehicle : public GameObject
{
protected:
    // Add some more components specific to Vehicles...
    Chassis* m_pChassis;
    Engine* m_pEngine;
    // ...

public:
    Vehicle()
    {
        // Construct standard GameObject components.
        m_pMeshInst = new MeshInstance;
        m_pRigidBody = new RigidBody;
        
        // NOTE: We'll assume the animation controller
        // must be provided with a reference to the mesh
        // instance so that it can provide it with a
        // matrix palette.
        m_pAnimController = new AnimationController(*m_pMeshInst);
        
        // Construct vehicle-specific components.
        m_pChassis = new Chassis(*this,*m_pAnimController);
        m_pEngine = new Engine(*this);
    }
    ~Vehicle()
    {
        // Only need to destroy vehicle-specific
        // components, as GameObject cleans up the
        // standard components for us.
        delete m_pChassis;
        delete m_pEngine;
    }
};

3.6 Real-Time Object Model Updating

Every game engine, from the simplest to the most complex, requires some means of updating the internal state of every game object over time. The state of a game object can be defined as the values of all its attributes. For example, the state of the ball in Pong is described by its ( x, y ) position on the screen and its velocity (speed and direction of travel). Because games are dynamic, time-based simulations, a game object’s state describes its configuration at one specific instant in time. In other words, a game object’s notion of time is discrete rather than continuous.

Most low-level engine subsystems (rendering, animation, collision, physics, audio and so on) require periodic updating, and the game object system is no exception. As we saw in The Game Loop, updating is usually done via a single master loop called the game loop. Virtually all game engines update game object states as part of their main game loop—in other words, they treat the game object model as just another engine subsystem that requires periodic servicing.

Game object updating can therefore be thought of as the process of determining the state of each object at the current time \(\boldsymbol{S}_i(t)\) given its state at a previous time \(\boldsymbol{S}_i(t − \Delta t)\). Once all object states have been updated, the current time t becomes the new previous time (\(t − \Delta t\)), and this process repeats for as long as the game is running. Usually, one or more clocks are maintained by the engine—one that tracks real time exactly and possibly others that may or may not correspond to real time. These clocks provide the engine with the absolute time t and/or with the change in time ∆t from iteration to iteration of the game loop. The clock that drives the updating of game object states is usually permitted to diverge from real time. This allows the behaviors of the game objects to be paused, slowed down, sped up or even run in reverse - whatever is required in order to suit the needs of the game design. These features are also invaluable for debugging and development of the game.

Note that game objects updating exhibits aspects of discrete event simulation and agent-based simulation, which are topics covered in-depth in the Master programme on Simulation.

The simplest way to update the states of a collection of game objects is to iterate over the collection and call a virtual function, named something like Update(), on each object in turn. This is typically done once during each iteration of the main game loop (i.e., once per frame). Game object classes can provide custom implementations of the Update() function in order to perform whatever tasks are required to advance the state of that type of object to the next discrete time index. The time delta from the previous frame can be passed to the update function so that objects can take proper account of the passage of time. At its simplest, then, our Update() function’s signature might look something like this:

void Update(float dt);

A game object’s Update() function is primarily responsible for determining the state of that game object at the current discrete time index \(\boldsymbol{S}_i(t)\) given its previous state \(\boldsymbol{S}_i(t − \Delta t)\). Doing this may involve applying a rigid body dynamics simulation to the object, sampling a preauthored animation, reacting to events that have occurred during the current time step and so on.

Most game objects interact with one or more engine subsystems. They may need to animate, be rendered, emit particle effects, play audio, collide with other objects and static geometry and so on. Each of these systems has an internal state that must also be updated over time, usually once or a few times per frame. It might seem reasonable and intuitive to simply update all of these subsystems directly from within the game object’s Update() function. For example, consider the following hypothetical update function for a Tank object:

void Tank::Update(float dt)
{
    // Update the state of the tank itself.
    MoveTank(dt);
    DeflectTurret(dt);
    FireIfNecessary();
    
    // Now update low-level engine subsystems on behalf
    // of this tank. (NOT a good idea... see below!)
    m_pAnimationComponent->Update(dt);
    m_pCollisionComponent->Update(dt);
    m_pPhysicsComponent->Update(dt);
    m_pAudioComponent->Update(dt);
    m_pRenderingComponent->draw();
}

Within the main game loop the list of all game objects would then be iterated and Update called on each of them:

while (true)
{
  float dt = g_gameClock.CalculateDeltaTime();
  
  for (each gameObject)
  {
    // This hypothetical Update() function updates
    // all engine subsystems!
    gameObject.Update(dt);
  }
}

Unfortunately, this approach is not feasible for modern game engines due to extremely stringent performance constraints. This has the consequence that game engines generally perform bached updating, where instead of update each object’s subsystems interleaved with other unrelated operations, each engine subsystem is updated directly by the game loop rather than on a per-game object basis from within the Update() function. This is also necessary in physics simlation, where rigid bodies simulation have to be performed on all objects at once, instead of a per-game object basis.

Therefore, this is a decoupling of the what, when and how: a game object control what is to be processed by a subsystem and the subsystem controls when this is done and translates the what of the game object into how the subsystem delivers this. For example a game object might define what is to be rendered by holding a reference to a triangle mesh, but the rendering systems determines when this is happening and how it is doing so (for example through command buffers). The hypothetical tank object looks then like:

void Tank::Update(float dt)
{
    // Update the state of the tank itself.
    MoveTank(dt);
    DeflectTurret(dt);
    FireIfNecessary();
    
    // Control the properties of my various engine
    // subsystem components, but do NOT update
    // them here...
    if (justExploded)
    {
        m_pAnimationComponent->PlayAnimation("explode");
    }
    
    if (isVisible)
    {
        m_pCollisionComponent->Activate();
        m_pRenderingComponent->Show();
    }
    else
    {
        m_pCollisionComponent->Deactivate();
        m_pRenderingComponent->Hide();
    }
    
    // etc.
}

With the game loop then performing something like:

while(true) 
{
    float dt = g_gameClock.CalculateDeltaTime();
    
    for (each gameObject)
    {
        gameObject.Update(dt);
    }

    g_animationEngine.Update(dt);
    g_physicsEngine.Simulate(dt);
    g_collisionEngine.DetectAndResolveCollisions(dt);
    g_audioEngine.Update(dt);
    g_renderingEngine.RenderFrameAndSwapBuffers();
}

3.6.1 Dependencies

Sometimes game objects depend on one another, for example if a human character is standing on a floating boat this requires to update the boat first and then the human character for correct positioning, therefore imposing an order on the updates. Therefore, many game engines allow their game objects to run update logic at multi- ple points during the frame. For example, the Naughty Dog engine (the engine that drives the Uncharted and The Last of Us game series) updates game objects three times—once before animation blending, once after animation blending but prior to final pose generation and once after final pose generation. In such a system, the game loop ends up looking something like this:

while (true) // main game loop
{
    // ...
    
    for (each gameObject)
    {
        gameObject.PreAnimUpdate(dt);
    }
    
    g_animationEngine.CalculateIntermediatePoses(dt);
    
    for (each gameObject)
    {
        gameObject.PostAnimUpdate(dt);
    }
    
    g_ragdollSystem.ApplySkeletonsToRagDolls();
    g_physicsEngine.Simulate(dt); // runs ragdolls too
    g_collisionEngine.DetectAndResolveCollisions(dt);
    g_ragdollSystem.ApplyRagDollsToSkeletons();
    g_animationEngine.FinalizePoseAndMatrixPalette();
    
    for (each gameObject)
    {
        gameObject.FinalUpdate(dt);
    }
    
    // ...
}

To make such a system more efficient, callbacks could be employed, where game objects register for various update phases. This has the benefit that the updated phases are not called for literally ever game object, even if it doesn’t need it but only for the ones who have registered for it.

3.6.2 Inconsistencies

In theory, the states of all game objects are updated from time \(t_1\) to time \(t_2\) instantaneously and in parallel, as depicted in Figure 3.51.

In theory, the states of all game objects are updated instantaneously and in parallel during each iteration of the game loop (Figure taken from @gregory2018game).

Figure 3.51: In theory, the states of all game objects are updated instantaneously and in parallel during each iteration of the game loop (Figure taken from [13]).

However, this is view is not accurate, for example, animation pose blending may have been run, but physics and collision resolution may not yet have been applied. This implies that if two game objects are checked for their current time during the update loop, they may not agree. This leads to the following rule: The states of all game objects are consistent before and after the update loop, but they may be inconsistent during it, see Figure 3.52.

In practice, the states of the game objects are updated one by one. This means that at some arbitrary moment during the update loop, some objects will think the current time is $t_2$ while others think it is still $t_1$. Some objects may be only partially updated, so their states will be internally inconsistent. In effect, the state of such an object lies at a point between $t_1$ and $t_2$ (Figure taken from @gregory2018game).

Figure 3.52: In practice, the states of the game objects are updated one by one. This means that at some arbitrary moment during the update loop, some objects will think the current time is \(t_2\) while others think it is still \(t_1\). Some objects may be only partially updated, so their states will be internally inconsistent. In effect, the state of such an object lies at a point between \(t_1\) and \(t_2\) (Figure taken from [13]).

This can lead to nasty bugs. For example, if object B looks at the velocit of object A in order to determine its own velocity at time t, then the programmer must be clear about whether he or she wants to read the previous state of object A, \(\boldsymbol{S}_A(t_1)\) , or the new state, \(\boldsymbol{S}_A(t_2)\). If the new state is needed but object A has not yet been updated, then we have an update order problem that can lead to a class of bugs known as one-frame-off lags. In this type of bug, the state of one object lags one frame behind the states of its peers, which manifests itself on-screen as a lack of synchronization between game objects.

A solution to this problem is for game objects to cache their previous state vector \(\boldsymbol{S}_i(t_1)\) , while calculating its new state vector \(\boldsymbol{S}_i(t_2)\), rather than overwriting it in-place during its update. At the end of the update for the current frame, the changes are then committed, overwriting the old states. This is safe as no access is happening anymore. While this approach solves the issues with consistencies, it also requires twice as much memory for game object state.

3.7 Concurrency in Game Object Updates

In this section, we’ll take a look at ways in which concurrency and parallelism can be applied to the problem of updating the states of our game objects. If the subsystems of a game engine run concurrently in their own thread (which is almost always the case in modern game engines), independent of whether or not our game object model is being updated in a single thread or across multiple cores, it needs to be able to interface with low-level engine systems. When employing a job system as discussed in Multicore Game Loops, it is easy to perform the work of the subsystems in jobs, which are scheduled on user-level threads, fibers, or OS threads. However, when making our low-level engine systems update concurrently, we need to ensure that their interfaces are thread-safe:

  • When the subsystems are running in user-level threads (coroutines or fibers), then we will need to use spin locks to make our subsystems thread-safe.
  • When using OS threads, then mutexes are a viable option for this purpose.
  • Sometimes lock-free mechanism or ReadWriteLocks are an option.

When employing a job system, we need to translate synchronous calls to subsystems into asynchronous calls using the job system. For example, a game might request that a ray be cast into the world in order to determine whether the player has line-of-sight to an enemy character. In a synchronous design, the ray cast would be done immediately in response to the request, and when the ray casting function returned, the results would be available:

SomeGameObject::Update()
{
    // ...

    // Cast a ray to see if the player has line of sight
    // to the enemy.
    RayCastResult r = castRay(playerPos, enemyPos);
    
    // Now process the results...
    if (r.hitSomething() && isEnemy(r.getHitObject()))
    {
        // Player can see the enemy.
        // ...
    }
    
    // ...
}

Translating to an asynchronous call using a job system we would so something like:

SomeGameObject::Update()
{
    // ...
    
    // Cast a ray to see if the player has line of sight
    // to the enemy.
    RayCastResult r;
    requestRayCast(playerPos, enemyPos, &r);
    
    // Do other unrelated work while we wait for the
    // other CPU to perform the ray cast for us.
    
    // ...
    
    // OK, we can't do any more useful work. Wait for the
    // results of our ray cast job. If the job is
    // complete, this function will return immediately.
    // Otherwise, the main thread will idle until the
    // results are ready...
    waitForRayCastResults(&r);
    
    // Process results...
    if (r.hitSomething() && isEnemy(r.getHitObject()))
    {
        // Player can see the enemy.
        // ...
    }
    
    // ...
}

In many instances, asynchronous code can kick off a request on one frame, and pick up the results on the next:

RayCastResult r;
bool rayJobPending = false;

SomeGameObject::Update()
{
    // ...
    
    // Wait for the results of last frame's ray cast job.
    if (rayJobPending)
    {
        waitForRayCastResults(&r);
        // Process results...
        if (r.hitSomething() && isEnemy(r.getHitObject()))
        {
            // Player can see the enemy.
            // ...
        }
    }
    
    // Cast a new ray for next frame.
    rayJobPending = true;
    requestRayCast(playerPos, enemyPos, &r);
    
    // Do other work...
    // ...
}

Unfortunately, game object models are notoriously difficult to parallelize for a few reasons. Game objects tend to be highly interdependent, and are typically dependent on numerous engine subsystems as well. Inter-object dependencies arise because game objects routinely communicate with one another and query one another’s states during their updates. These interactions tend to occur multiple times during the update loop, and the pattern of communication can be unpredictable and highly sensitive to the inputs of the human player and the events that are occurring in the game world. This makes it difficult to process game object updates concurrently (using multiple threads or multiple CPU cores). The main challenge is to deal with degree of parallelism, as can be seen in Figure 3.53.

Three job dependency trees. The nodes of the tree are jobs, and the arrows represent dependencies between them. The number of leaves in such a tree indicates the system’s degree of parallelism (DOP) (Figure taken from @gregory2018game).

Figure 3.53: Three job dependency trees. The nodes of the tree are jobs, and the arrows represent dependencies between them. The number of leaves in such a tree indicates the system’s degree of parallelism (DOP) (Figure taken from [13]).

The main approach to do this is to be clever about synchronisation points, as can be seen in Figure 3.54.

A synchronization point is introduced whenever one job is dependent upon the completion of one or more other jobs. Here, job D depends on job C, and job F depends on jobs A, B, D and E (Figure taken from @gregory2018game).

Figure 3.54: A synchronization point is introduced whenever one job is dependent upon the completion of one or more other jobs. Here, job D depends on job C, and job F depends on jobs A, B, D and E (Figure taken from [13]).

In the following, we briefly explain how Naughty Dog implemented their concurrent game object model, which is built on the job system introduced in Multicore Game Loops.

3.7.1 Naughty Dogs concurrent Game Object Model

The idea in Naught Dogs concurrent game object model is to use the job system, not only for parallelising various low-level subsystems (such as animation, audio, physics…) but making game object updates jobs too. Making this work correctly and efficiently is non-trivial however. They employed a number of techniques to deal with these issues:

  • When it comes to object dependencies, they are using a bucket system. In such a system, objects of bucket \(B\) depend only on objects in buckeds 0 to \(B - 1\). This allows for updating the game objects of a bucket by kicking them off as jobs, and let the job system schedule them across the available cores.
void UpdateBucket(int iBucket)
{
    job::Declaration aDecl[kMaxObjectsPerBucket];
    const int count = GetNumObjectsInBucket(iBucket);
    
    for (int jObject = 0; jObject < count; ++jObject)
    {
        job::Declaration& decl = aDecl[jObject];
        decl.m_pEntryPoint = UpdateGameObjectJob;
        decl.m_param = reinterpret_cast<uintptr_t>(
        GetGameObjectInBucket(iBucket, jObject));
    }
    
    job::KickJobsAndWait(aDecl, count);
}
  • The fibre-based job system allows to kick of asynchronous requests for subsystems and other objects. The subsystem would process the request at some future time during the frame and later in the frame, or in the next frame, the game object would pick up the results and act on them.

  • With the bucket system inter-bucket querying is safe: a bucket B can safely query other buckets. However, querying objects of the same bucket needs more care as they are processed in parallel in individual jobs. Accessing them could cause data races or inconsistencies. The solution of Naughty Dog was to do object snapshots, as explained above in Inconsistencies. At the start of each bucket update, each game object updates it snapshot. As these updates nevery query the state of other game objeccts, they can be ran concurrently without locks. Then the concurrent jobs are kicked off to update the game objects internal states, also running concurrently. Objects can now safely query each other also within the same bucket.

  • Sometimes game objects need to mutate each other. With the system described so far this is not an issue between buckets, but it can be tricky with objects in the same bucket. Naughty Dog solved it with locks and request queues. The request queue is protected by a lock and the handling of the requests in the queue is deferred until after the bucket update has completed.

3.8 Rendering Architecture

In this section we have a very brief look how a rendering architecture fits into a modern game engine with a concurrent job and game object model system. We are not discussing individual CG topics such as such as culling, shading, particles, global illumination,… but the rendering fits into the overal architecture and what is the approach of separating the data from the actual rendering: separating domain/application from presentation.

An issue with a concurrent job and game object system is that that parallel processing often does not map to hardware constraints. This has consequences for the rendering part of the engine, as ultimately the access to the GPU is sequential. Therfore, before the rise of multi core CPUs in mass market, graphic APIs such as DirectX and OpenGL allowed only one thread to access the graphics driver at a time, making parallel processing for the actual rendering stage quite difficult.

In general, there are two operations in a graphics driver that can potentially use multiple processors: resource creation and render-related calls. Creating resources such as textures and buffers can be purely CPU-side operations and so are naturally parallelizable. That said, creation and deletion can also be blocking tasks, as they might trigger operations on the GPU or need a particular device context. In any case, older APIs were created before consumer-level multiprocessing CPUs existed, so needed to be rewritten to support such concurrency.

The key construct, modern multi core game engines use are the so called command buffer or command list. A command buffer (CB) is a list of API state change and rendering calls. Such lists can be created, stored, and replayed as desired. They may also be combined to form longer command buffers. Only a single CPU processor communicates with the GPU via the driver and so can send it a CB for execution. However, every processor (including this single processor) can create or concatenate stored command buffers in parallel.

In DirectX 11, for example, the processor that communicates with the driver sends its render calls to what is called the immediate context. The other processors each use a deferred context to generate command buffers. As the name implies, these are not directly sent to the driver. Instead, these are sent to the immediate context for rendering, see Figure 3.55.

Command buffers. Each processor uses its deferred context, shown in orange, to create and populate one or more command buffers, shown in blue. Each command buffer is sent to Process #1, which executes these as desired, using its immediate context, shown in green. Process #1 can do other operations while waiting for command buffer N from Process #3. (Figure taken from @akenine2019real).

Figure 3.55: Command buffers. Each processor uses its deferred context, shown in orange, to create and populate one or more command buffers, shown in blue. Each command buffer is sent to Process #1, which executes these as desired, using its immediate context, shown in green. Process #1 can do other operations while waiting for command buffer N from Process #3. (Figure taken from [2]).

Alternately, a command buffer can be sent to another deferred context, which inserts it into its own CB. Beyond sending a command buffer to the driver for execution, the main operations that the immediate context can perform that the deferred cannot are GPU queries and readbacks. Otherwise, command buffer management looks the same from either type of context.

An advantage of command buffers is that they can be stored and replayed. Command buffers are not fully bound when created, which aids in their reuse. For example, say a CB contains a view matrix. The camera moves, so the view matrix changes. However, the view matrix is stored in a constant buffer. The constant buffer’s contents are not stored in the CB, only the reference to them. The contents of the constant buffer can be changed without having to rebuild the CB. Determining how best to maximize parallelism involves choosing a suitable granularity—per view, per object, per material—to create, store, and combine command buffers.

Such multithreading draw systems existed for years before command buffers were made a part of modern APIs. API support makes the process simpler and lets more tools work with the system created. However, command lists do have creation and memory costs associated with them. Also, the expense of mapping an API’s state settings to the underlying GPU is still a costly operation with DirectX 11 and OpenGL. Within these systems command buffers can help when the application is the bottleneck, but can be detrimental when the driver is.

Certain semantics in these earlier APIs did not allow the driver to parallelize various operations, which helped motivate the development of Vulkan, DirectX 12, and Metal. A thin draw submission interface that maps well to modern GPUs minimizes the driver costs of these newer APIs. Command buffer management, memory allocation, and synchronization decisions become the responsibility of the application instead of the driver. In addition, command buffers with these newer APIs are validated once when formed, so repeated playback has less overhead than those used with earlier APIs such as DirectX 11. All these elements combine to improve API efficiency, allow multiprocessing, and lessen the chances that the driver is the bottleneck.

References

[2] Tomas Akenine-Möller, Eric Haines, and Naty Hoffman. 2019. Real-time rendering. Crc Press.

[9] David Eberly. 2006. 3D game engine design: A practical approach to real-time computer graphics. CRC Press.

[13] Jason Gregory. 2018. Game engine architecture. crc Press.

[22] Mike McShaffry. 2014. Game coding complete. Nelson Education.


  1. https://github.com/id-Software/DOOM↩︎

  2. https://github.com/id-Software/Quake↩︎

  3. https://github.com/id-Software/Quake-2↩︎

  4. Note that modern GPUs of 2020 are capable of rendering 20 million polygons per frame. Assuming 60 FPS, the difference amounts to a mind-boggling increase of geometric complexity of a factor of up to 120!↩︎

  5. https://github.com/id-Software/Quake-III-Arena↩︎

  6. https://en.wikipedia.org/wiki/Fast_inverse_square_root↩︎

  7. https://github.com/id-Software/DOOM-3↩︎

  8. https://github.com/id-Software/DOOM-3-BFG↩︎

  9. The other is GUI programming, and in fact, the origins of object-oriented programming go back to computer simulation and GUI programming. We think that in both of these domains object-oriented programming is by far superior to other paradigms such as functional programming.↩︎