Need help to optimize Draw cost

Patapits · May 13, 2019, 2:33pm

This topic is a bit linked to the other one I created earlier.

Currently, my 2D game scene is composed of several objects (around 500), each having it’s own geometry (from 4 to 1000 vertices). I have a total of 30000 vertices in my game World. These objects can change in position, size, color, depth. There is no lighting done, so I’m only using BasicEffect with VertexPositionColor (or eventually VertexPositionColorTexture if needed).

The basic question is : how can I efficiently draw these vertices ?

Currently, I have one big DynamicVertexBuffer and one big DynamicIndexBuffer for the whole game, each geometry using a defined part of them. When one of the object parameters changes, I update its vertices and set them in the big buffer using SetData with SetDataOptions.NoOverwrite.
I also have only one BasicEffect for the whole game.
When Draw() is called on an object, I store that call in a DrawBatch. When every draw calls has been batched, I sort them by object’s depth in ascending order (I need alpha blending), and then I call an ApplyDraw method on each object, something similar to this code :

Device.SetVertexBuffer(commonVertexBuffer);
Device.Indices = commonIndexBuffer;

foreach (EffectPass pass in commonEffect.CurrentTechnique.Passes)
{
    pass.Apply();
    Device.DrawIndexedPrimitives(PrimitiveType.TriangleList, objectVertexBufferOffset, objectIndexBufferOffset,
        objectIndexCount / 3);
}

Why I’m not satisfied ? Because :

I can’t find a proper way to allocate a part of a DynamicVertexBuffer to a specific object without some side effects (mad flickering when using SetDataOptions.Discard, some weird behavior like stated in the other topic when using SetDataOptions.NoOverwrite)
My game sometimes lags at 60 fps on Windows and lags badly on Android. I did a CPU profiling, and it’s pretty clear that anything that is eating time on each tick is drawing-related. Moreover, on Windows, the Task Manager shows around 50% GPU usage for the game when it’s running, which is probably an alert that something might be wrong with the way I’m drawing everything

What I already tried :

One BasicEffect per object : obviously performance was really bad
A common BasicEffect for all objects, but one VertexBuffer/IndexBuffer per object : performance was not great either

Thanks for your help !

markus · May 13, 2019, 7:32pm

Are you calling SetData for every object every frame? That sounds like a bad idea. If possible you should use static meshes, and use shader parameters to change properties like position and color.

Your vertex count is low, so it’s really just the number of draw calls you need to be concerned about. Putting multiple objects into one vertex buffer is a good strategy for reducing draw calls, but you don’t want to change the vertex buffer constantly. Set it up once in the beginning, and don’t change it afterwards. Add an index to your vertex data, so every vertex knows which object it belongs to. In the vertex shader you can then use this index to get per-object data from arrays.
Unfortunately that means you have to use a custom shader.

Patapits · May 13, 2019, 7:58pm

Thanks for your input on my case
I’m using SetData only when it’s needed, so only when a property has changed on an object (I implemented INotifyPropertyChanged to help me doing so). So when lots of objects are moving in my scene, it has some impact on the CPU usage.
What I was doing before is having a static mesh and setting the BasicEffect World and DiffuseColor properties for each object. But it performed worse than the actual version, probably because lots of state changes happened on the BasicEffect ? (setting World and DiffuseColor for each object on every Draw call)
It’s not a problem to create a custom shader, but it will take me some time (since I’ve never created a shader). I’m just wondering : is it really more performant to update a buffer with all of the object’s properties (like position and color) and pass it to the shader, rather than to use World and DiffuseColor (for instance) on a BasicEffect ? I’m really not an expert on GPU, but to me it’s more or less the same amount of data that is sent to the GPU, so I cannot really “feel” why it will greatly increase my game performance.

markus · May 13, 2019, 8:42pm

You’re right, the amount of data sent to the GPU is about the same in both cases, but that’s not really the bottleneck in this case. Even for 500 objects with a couple of parameters per object, that still amounts to very little data-wise. The real bottleneck is the overhead you get from having many separate draw calls, especially if there are many state changes between them.

willmotil · May 14, 2019, 3:24am

I keep this around as its useful and i post it pretty often maybe this will help with getting started with your own shader. There is two examples one with spritebatch that gives it a pixel shader one without that mimics the vertex shader setup.

Patapits · May 14, 2019, 7:05am

Thank you both, I’ll try to setup a shader to see if it fits my performance need !

Patapits · May 14, 2019, 12:23pm

I now have a working shader, but just a quick question : how would you have passed objects parameters (like position and color) to the shader ? I’m using arrays, but it seems like it has a size limitation, like stated here :

Each shader stage allows up to 15 shader-constant buffers; each buffer can hold up to 4096 constants.

How can I do if I’ve lots of instances ?

What my current shader look like :

#define MAXINSTANCES 100

float4x4 Worlds[MAXINSTANCES];
float4 Colors[MAXINSTANCES];

float4x4 View;
float4x4 Projection;

struct VertexShaderInput
{
    float2 Position : POSITION;
    float4 Color    : COLOR;
    uint Index      : TEXCOORD0;
};

struct VertexShaderOutput
{
    float4 Position : POSITION;
    float4 Color    : COLOR;
};

VertexShaderOutput VertexShaderFunction(VertexShaderInput input)
{
    VertexShaderOutput output;

    float4 position = float4(input.Position,0,1);

    float4 worldPosition = mul(position, Worlds[input.Index]);
    float4 viewPosition = mul(worldPosition, View);
    output.Position = mul(viewPosition, Projection);
    output.Color = input.Color * Colors[input.Index];

    return output;
}

float4 PixelShaderFunction(VertexShaderOutput input) : COLOR0
{
    return input.Color;
}

technique Custom
{
    pass Pass
    {
        VertexShader = compile vs_4_0 VertexShaderFunction();
        PixelShader = compile ps_4_0 PixelShaderFunction();
    }
}

If I try to set MAXINSTANCES to 1000 for example, I get the following error at compile-time :

error X8000: D3D11 Internal Compiler Error: Invalid Bytecode: Index Dimension 2 out of range (5008 specified, max allowed is 4096) for operand #1 of opcode #1 (counts are 1-based). Aborting.

markus · May 14, 2019, 1:07pm

With MAXINSTANCES set to 1000 you’re running into the upper limit of 4096 constants.
It’s quite easy to understand where the 5008 in the error message is coming from. First of all the 4096 constants are float4. Your world matrices are float4x4, this means each matrix uses 4 float4 constants. One more float4 is used per instance for the color float4, so you need 5 constants in total per instance. For 1000 instances that would be 5000 constant, plus 8 constants for your view and projection matrix. As it stands MAXINSTANCES = (4096 - 8) / 5 = 817

If you have more instances than you can fit in your shader, just start another batch. If you can do a couple hundred instances in one draw call you’re good.

You can probably save a little by using 4x3 world matrices, you generally only need 4x4 for projection matrices.

Patapits · May 14, 2019, 1:21pm

Yeah, this is also the conclusion I’ve made so no way to store and access these values to/from a buffer (something similar to a VertexBuffer) ? I’ve also seen some people using Texture2D to store these kind of values without carrying about that limit, but I don’t know if it’s worth the effort…

markus · May 14, 2019, 1:35pm

There’s hardware instancing, where you use a second vertex buffer that holds all of the instance data. It only works for instances that use the same mesh, but that’s also an advantage, because you don’t have to build this giant vertex buffer. You mentioned Android above, and I don’t think hardware instancing is supported there.
And yes, you could also use textures to store the instance data, but again I’m not sure vertex texture fetch works on Android, it might. I would also guess that it might be a bit slower than using constants.

Patapits · May 14, 2019, 1:45pm

I’ve also looked into hardware instancing, but it seems to be a bit too “specific” to fit my needs. I’ve lots of different geometries, so I guess using it will result in lots of draw calls (one for each kind of geometry).
I’m working on getting the above shader fully functionnal, and I’m wondering : how can I handle the sorting of my game objects by depth ? Because every vertices is already in the giant VertexBuffer, so I can’t reorder them. Or maybe recreating an ordered buffer on each draw call ? But I think it’ll lead to some performance issues again…
Anyway, thanks again for your answers

markus · May 14, 2019, 1:59pm

If your objects where all the same you could just sort the array, but since your objects are all different geometries that’s indeed a problem. It’s for transparency right? I would look into techniques where you don’t need to render in a sorted order, see order-independent transparency.

Patapits · May 14, 2019, 3:58pm

Yes it’s for transparency. I came accross some topics about order-independent transparency back when I understood that alpha blending isn’t magic. I wanted to find a more general and flexible way to implement it, because I found ordering too restrictive (why taking that much time to order things just for transparency), but didn’t found any easy solution. So I will try to dig deeper into it !
EDIT : well, I remember now why I gave up on order-independant transparency : it’s way too complicated for my case (a simple 2D game, without any complex intersections betweens semi-transparent objects). So I need to find a proper way to sort my game objects before rendering without losing much on performance, while using the kind of approach you pointed me towards ! I guess it’s not going to be easy, once again

markus · May 14, 2019, 5:07pm

I don’t know what your game looks like, but are there really that many half-transparent objects? For full transparency you could still use the depth buffer to take care of depth sorting.

Can’t you get away with rendering all solid objects first, and all transparent objects afterwards. If there’s just a few transparent objects, sorting them shouldn’t be a big problem.
And even if you don’t sort them, there’s only a problem if you have transparent objects behind transparent objects. You can still use the depth buffer to make sure that more distant objects don’t draw on top of closer objects. The only problem that’s left, you have a 50% chance for a more distant object to disappear behind a closer object. Whether that’s acceptable or not very much depends on your game world.

Patapits · May 14, 2019, 7:06pm

The problem is that many objects change in transparency during the game (typically alpha going linearly from 0 to 1 or inversely). So I think that maintening two buffers (one unsorted - or sorted front-to-back - for the solid objects, one sorted back-to-front for the semi-transparent ones) will have some significant impact on performance, besides making the above solution (custom shader with static vertex buffer) not suitable anymore.
Also, there is no concept of “distant” object in my game : the camera has a basic orthographic Projection, the depth (z component) is used as layers to properly display the different game objects.

willmotil · May 14, 2019, 9:51pm

Vertex texture fetch doesn’t work on desktop last i heard id be surprised if it worked on android.

Instancing is the only thing i can think of to solve this sort of problem 500 draws when each is a different geometry with a ton of vertice is a lot. Even then im not sure how you can make that work.
I think the transparency is far less of a problem then that is.

Patapits · May 14, 2019, 10:05pm

On mobile you meant ? On desktop it’s working with models vs_4_0 and ps_4_0 (I just made it works, seems to be relatively fast, need to do more testing tomorrow).

willmotil · May 14, 2019, 10:12pm

Well they must of just added it recently i think i tested it on gl like a couple months ago and it didn’t work.
Instancing with Vtf is about as fast as you can get i think.
Anyways i think you would need to provide more in depth info to get much more help it seems that what you have is a complicated structural problem fundamentally.

markus · May 14, 2019, 11:49pm

If you can draw all your objects in a single or a few draw calls, and it’s just 30K vertices in total, you could even consider this method for depth sorting:

Draw your scene in multiple depth slices. For each slice draw your entire vertex buffer containing all the objects, but set a near and a far clipping plane to only draw stuff in a certain depth range. That way you can build up your scene in layers back to front.

Yes, that sounds a bit wasteful, and it is a bit wasteful, but with say 10 slices you are still only drawing 300K vertices per frame, which is still very little, at least on desktop.

Just an idea. If your vertex count or your number of draw calls gets bigger, this method becomes less and less viable.

LithiumToast · May 15, 2019, 1:54am

If you have not read it yet: http://realtimecollisiondetection.net/blog/?p=86

To mimic the behaviour in the blog with C#:

Use an int for a sorting key. It’s up to you to define the bits, but you could use BitVector32 to help you. I recommend you try with 32-bit integer to start as it’s usually the word size.
Create three arrays. One is simply the sorting keys. The other the vertices. The last is indices.
Fill all three arrays as buffers for your drawing calls.
When you need to flush. Use Array.Sort(Array, Array) to sort the indices by the sorting key array. Then upload the sorted indices to the GPU as an index buffer. The vertex buffer can be uploaded at the same time. But don’t sort the vertices; it’s going to be more CPU intensive to swap larger structs than 16-bit or 32-bit integers.
After uploading the geometry to the GPU, draw it using a single draw call: GraphicsDevice.DrawIndexedPrimitives.