Need help to optimize Draw cost

Thank you both, I’ll try to setup a shader to see if it fits my performance need !

I now have a working shader, but just a quick question : how would you have passed objects parameters (like position and color) to the shader ? I’m using arrays, but it seems like it has a size limitation, like stated here :

Each shader stage allows up to 15 shader-constant buffers; each buffer can hold up to 4096 constants.

How can I do if I’ve lots of instances ?

What my current shader look like :

#define MAXINSTANCES 100

float4x4 Worlds[MAXINSTANCES];
float4 Colors[MAXINSTANCES];

float4x4 View;
float4x4 Projection;

struct VertexShaderInput
{
    float2 Position : POSITION;
    float4 Color    : COLOR;
    uint Index      : TEXCOORD0;
};

struct VertexShaderOutput
{
    float4 Position : POSITION;
    float4 Color    : COLOR;
};

VertexShaderOutput VertexShaderFunction(VertexShaderInput input)
{
    VertexShaderOutput output;

    float4 position = float4(input.Position,0,1);

    float4 worldPosition = mul(position, Worlds[input.Index]);
    float4 viewPosition = mul(worldPosition, View);
    output.Position = mul(viewPosition, Projection);
    output.Color = input.Color * Colors[input.Index];

    return output;
}

float4 PixelShaderFunction(VertexShaderOutput input) : COLOR0
{
    return input.Color;
}

technique Custom
{
    pass Pass
    {
        VertexShader = compile vs_4_0 VertexShaderFunction();
        PixelShader = compile ps_4_0 PixelShaderFunction();
    }
}

If I try to set MAXINSTANCES to 1000 for example, I get the following error at compile-time :

error X8000: D3D11 Internal Compiler Error: Invalid Bytecode: Index Dimension 2 out of range (5008 specified, max allowed is 4096) for operand #1 of opcode #1 (counts are 1-based). Aborting.

With MAXINSTANCES set to 1000 you’re running into the upper limit of 4096 constants.
It’s quite easy to understand where the 5008 in the error message is coming from. First of all the 4096 constants are float4. Your world matrices are float4x4, this means each matrix uses 4 float4 constants. One more float4 is used per instance for the color float4, so you need 5 constants in total per instance. For 1000 instances that would be 5000 constant, plus 8 constants for your view and projection matrix. As it stands MAXINSTANCES = (4096 - 8) / 5 = 817

If you have more instances than you can fit in your shader, just start another batch. If you can do a couple hundred instances in one draw call you’re good.

You can probably save a little by using 4x3 world matrices, you generally only need 4x4 for projection matrices.

Yeah, this is also the conclusion I’ve made :sweat_smile: so no way to store and access these values to/from a buffer (something similar to a VertexBuffer) ? I’ve also seen some people using Texture2D to store these kind of values without carrying about that limit, but I don’t know if it’s worth the effort…

There’s hardware instancing, where you use a second vertex buffer that holds all of the instance data. It only works for instances that use the same mesh, but that’s also an advantage, because you don’t have to build this giant vertex buffer. You mentioned Android above, and I don’t think hardware instancing is supported there.
And yes, you could also use textures to store the instance data, but again I’m not sure vertex texture fetch works on Android, it might. I would also guess that it might be a bit slower than using constants.

I’ve also looked into hardware instancing, but it seems to be a bit too “specific” to fit my needs. I’ve lots of different geometries, so I guess using it will result in lots of draw calls (one for each kind of geometry).
I’m working on getting the above shader fully functionnal, and I’m wondering : how can I handle the sorting of my game objects by depth ? Because every vertices is already in the giant VertexBuffer, so I can’t reorder them. Or maybe recreating an ordered buffer on each draw call ? But I think it’ll lead to some performance issues again…
Anyway, thanks again for your answers :slight_smile:

If your objects where all the same you could just sort the array, but since your objects are all different geometries that’s indeed a problem. It’s for transparency right? I would look into techniques where you don’t need to render in a sorted order, see order-independent transparency.

Yes it’s for transparency. I came accross some topics about order-independent transparency back when I understood that alpha blending isn’t magic. I wanted to find a more general and flexible way to implement it, because I found ordering too restrictive (why taking that much time to order things just for transparency), but didn’t found any easy solution. So I will try to dig deeper into it !
EDIT : well, I remember now why I gave up on order-independant transparency : it’s way too complicated for my case (a simple 2D game, without any complex intersections betweens semi-transparent objects). So I need to find a proper way to sort my game objects before rendering without losing much on performance, while using the kind of approach you pointed me towards ! I guess it’s not going to be easy, once again :sweat_smile:

I don’t know what your game looks like, but are there really that many half-transparent objects? For full transparency you could still use the depth buffer to take care of depth sorting.

Can’t you get away with rendering all solid objects first, and all transparent objects afterwards. If there’s just a few transparent objects, sorting them shouldn’t be a big problem.
And even if you don’t sort them, there’s only a problem if you have transparent objects behind transparent objects. You can still use the depth buffer to make sure that more distant objects don’t draw on top of closer objects. The only problem that’s left, you have a 50% chance for a more distant object to disappear behind a closer object. Whether that’s acceptable or not very much depends on your game world.

The problem is that many objects change in transparency during the game (typically alpha going linearly from 0 to 1 or inversely). So I think that maintening two buffers (one unsorted - or sorted front-to-back - for the solid objects, one sorted back-to-front for the semi-transparent ones) will have some significant impact on performance, besides making the above solution (custom shader with static vertex buffer) not suitable anymore.
Also, there is no concept of “distant” object in my game : the camera has a basic orthographic Projection, the depth (z component) is used as layers to properly display the different game objects.

Vertex texture fetch doesn’t work on desktop last i heard id be surprised if it worked on android.

Instancing is the only thing i can think of to solve this sort of problem 500 draws when each is a different geometry with a ton of vertice is a lot. Even then im not sure how you can make that work.
I think the transparency is far less of a problem then that is.

On mobile you meant ? On desktop it’s working with models vs_4_0 and ps_4_0 (I just made it works, seems to be relatively fast, need to do more testing tomorrow).

Well they must of just added it recently i think i tested it on gl like a couple months ago and it didn’t work.
Instancing with Vtf is about as fast as you can get i think.
Anyways i think you would need to provide more in depth info to get much more help it seems that what you have is a complicated structural problem fundamentally.

If you can draw all your objects in a single or a few draw calls, and it’s just 30K vertices in total, you could even consider this method for depth sorting:

Draw your scene in multiple depth slices. For each slice draw your entire vertex buffer containing all the objects, but set a near and a far clipping plane to only draw stuff in a certain depth range. That way you can build up your scene in layers back to front.

Yes, that sounds a bit wasteful, and it is a bit wasteful, but with say 10 slices you are still only drawing 300K vertices per frame, which is still very little, at least on desktop.

Just an idea. If your vertex count or your number of draw calls gets bigger, this method becomes less and less viable.

If you have not read it yet: http://realtimecollisiondetection.net/blog/?p=86

To mimic the behaviour in the blog with C#:

  1. Use an int for a sorting key. It’s up to you to define the bits, but you could use BitVector32 to help you. I recommend you try with 32-bit integer to start as it’s usually the word size.

  2. Create three arrays. One is simply the sorting keys. The other the vertices. The last is indices.

  3. Fill all three arrays as buffers for your drawing calls.

  4. When you need to flush. Use Array.Sort(Array, Array) to sort the indices by the sorting key array. Then upload the sorted indices to the GPU as an index buffer. The vertex buffer can be uploaded at the same time. But don’t sort the vertices; it’s going to be more CPU intensive to swap larger structs than 16-bit or 32-bit integers.

  5. After uploading the geometry to the GPU, draw it using a single draw call: GraphicsDevice.DrawIndexedPrimitives.

i propose a trick
split your objs to batches, group by vertices [ex… 4-100, 100-200, 200-300]
in each batch fill your objs vertex to bigest [yes, it waste but consider your 30k vertices then not that hurt]
render batch use indexbuffer [ignore those trash vertexes above]
use vertexid in shader to get object id [ex id = vertexid%groupbigestvertexnum]
for transparent, since you use orthographics then pull your obj to camera by very small amount based on id and batch, it mean obj that draw later will get closer to camera than ealier [add depth split layer here if needed]
done…

but with lots of transparency it will kill fillrate anyway

I can try. All game objects are derived from abstract base class Entity (simplified version below) :

public abstract class Entity
{
    public Vector2 Size { get; }
    public Vector2 Center { get; }
    public float Angle  { get; }
    public float Depth { get; }
    public Vector2 Scale { get; }
    public float Alpha { get; }
    public Entity HookedOn { get; }
    public Vector2 HookOffset { get; }

    public virtual void Update()
    {
    }

    public virtual void Draw(DrawBatch drawBatch)
    {
    }
}

HookedOn and HookOffset work like this : if Entity2 is hooked to Entity1 (Entity2.HookedOn == Entity1), then
its position should be Entity1.Center + Entity2.HookedOn + Entity2.Center (kind of a Parent/Child relationship).
Then I also have VertexEntity, which is derived from Entity and define some geometry (Vector2[] for vertices and short[] for indices). VertexEntity is overriding Draw to submit its geometry to the graphic card (for now it’s a call to a DrawBatch which will do the sort-by-depth-then-draw-vertices when every object has been batched).
By composing Entities, I can build my game scene. But it seems like it’s not performing well, hence this topic :slight_smile: I’m trying to find a generic way to make this kind of architecture working as fast as possible (when not working with tons of game objects or vertices).

Yeah, I can do that, but the problem of depth sorting can still happen on a single slice, and I have way too many layers to go for 1 layer = 1 slice. I would prefer a “cleaner” solution :sweat_smile:

Yeah, I thought doing that : instead of sorting vertices, just sorting indexes while keeping the same vertex buffer. It will still need to update the IndexBuffer on every draw call, which maybe will have a significant impact on performance, but since it’s an array of short in my case, it might be acceptable. I will try to dig deeper into this approach, thanks for the suggestion !

I didn’t understand everything on what you’re suggesting :sweat_smile:, but I will take a closer look at it if needed, thank you !

After some research and trial-and-error, it seems like fetching texture from vertex shader is not supported by Monogame on OpenGL (and therefore not working on Android). Is there any other way to pass a large amount of data to a shader ? I can’t seem to find an other way to do so…

If you can’t use hardware instancing or VTF, shader parameters are your only option.
I don’t see why the constant limit is such a big problem for you. I understand that having to start a new batch after running out of parameter space is a bit inconvenient, but it’s not the end of the world either.

No, sure, it’s not the end of the world, but before going for that route, I wanted to be sure there is no more convenient way to achieve what I want :slight_smile: