Please help me improve the performance of my shadowing shader


I think my lighting and shadowing shader could use some improvements. It “works” fine, as in the picture is good, but the performance is bad. Even a single call DrawIndexedPrimitives takes a lot of time. I cobbled together this shader from tutorials found online, translated some stuff to “monogame hlsl”. A lot of stuff is probably far from optimal, and maybe your eyes will bleed reading it, but if anyone would be so kind as to tell me if there are obvious paths to improve it, that would help a lot !


Thank you!

Looks fine.

The shadowing is 16 checks per fragment/pixel (actually should be 16, but you have <15 instead of <16 / <=15 so it’s only 15 - later you divide by 16 though). This shouldn’t be abnormally expensive, but you can try to set it to 4 samples for example and check if performance improves a lot.

If that doesn’t help it’s most likely not the shader.

But one thing:

In your content pipeline make sure that your models have GenerateTangentFrames enabled, otherwise your game will lag like hell.

So, it turns out that the problem is in my shadow map creation.

In the two shadowmaps below, the one on the bottom renders more than twice as fast. (same geometry, same resolution, the only difference is that I widened my orthographic projection on the second one, to include the coin at the bottom.

When I do that, the map renders a lot faster. Is it because less of the screen (texture actually) is taken by my geometry? ie, is this expected behaviour?

Thank you

Yes, more pixels = more cost.

However, the Shader should be really cheap, only very few lines, right?

Does the geometry in the middle have an excessive amount of polygons? (More than 1 Million)

And how bad is performance really with/without shadows?

The shader for generating the shadow map is indeed extremely simple :

\#if OPENGL #define SV_POSITION POSITION #define VS_SHADERMODEL vs_3_0 #define PS_SHADERMODEL ps_3_0 \#else #define VS_SHADERMODEL vs_4_0_level_9_3 #define PS_SHADERMODEL ps_4_0_level_9_3 \#endif

matrix WorldViewProjection;

struct InstancingVSinput
float4 Position : POSITION0;

struct InstancingVSoutput
float4 Position : POSITION0;
float Depth : TEXCOORD0;

InstancingVSoutput InstancingVS(InstancingVSinput input, float4 HighlightColor : TEXCOORD1, float4x4 World : TEXCOORD2)
InstancingVSoutput output;

float4 pos = mul(mul(input.Position, World), WorldViewProjection);

output.Position = pos;
output.Depth = pos.z / pos.w;

return output;


float4 InstancingPS(InstancingVSoutput input) : COLOR0
return float4(input.Depth, input.Depth, input.Depth, 1);

technique Instancing
pass P0
VertexShader = compile VS_SHADERMODEL InstancingVS();
PixelShader = compile PS_SHADERMODEL InstancingPS();

I found out one of my problems is that a spritebatch.Begin() was setting the blendstate to AlphaBlend. My guess is that it was messing with the culling of my objects, because when the blendstate is set to opaque, the shadowmap generates about twice as fast.

As of now, I am sending 150k triangles using instanciation to the gpu, on a 2048x2048 texture (SurfaceFormat = Single)

My GPU manages to draw them in 4.6ms, in the “worst case” scenario (most of my shadowmap texture is filled.

Does that sound about right, or should I look into it further?

I don’t know your computer and how much the whole frame takes to render, so it’s hard to call whether 4.6ms are fair, but it’s probably alright.

Most games have a big cut of their rendering time spent on shadows, 50% and more is not uncommon if you have realtime shadows.

I’ve been doing a little bit of reading on culling, and I found another big optimization which I am sharing here :slight_smile:

Pre sorting my objects by distance to the shadowmap camera (before creating the instance vertex buffer) makes the shadowmap rendering time go from 4.6ms to 1.8ms.

Such a big improvement is probably specific to my geometry (lots of objects are occluded)

I was rather suprised by this, my understanding was that one of the advantages of using hardware instancing was that all the occlusion culling stuff happened on the GPU. I guess it cannot hurt to help it by pre sorting objects, since the hit on my update method is non existant.

This means I could probably optimize my main drawing function by sorting my objects by distance to the real camera. However that would mean that I would have 2 different vertex buffer for instances, but that’s probably ok.

Thank you for your help!