I was wondering, what solution is better ? Use branching with an “if” when for doing alphatest, or depthpeeling for ex, or define a technique for each case ? The “if” will be on CPU when setting the currenttechnique anyway…
How “if” is good with nowadays GPUs ?
Ifs are just as not so good And good as the else errrr by that i mean.
Basically the gpu will simultaneously execute the if and the else.
Then figure it all out at the end, you can imagine how that could bottleneck.
Ifs are bad for shader code because it sort of breaks the parallel processing they do.
You can get away with a couple but too many and your pushing it.
It’s not easy to design functions that act to evaluate the if else thru straight math or always possible, but sometimes it is possible. To say in the view of polynominals or rounding.
This is sort of the idea with multi-texturing or texture splattering but used in the context of a switch.
For example say you Round( a*b ) were a and b are values from 0 to 1 the result can only be one if a and b are 1 other wise rounding down it will be 0. In this context you get a value that is 1 or 0 which symbolizes true or false.
if you were to add pixel colors from texture A by multiplying the colors against result and texture B by the result minus 1 you would get pixels from texture a or b but never both. This is useful sometimes like in the case of multi-texturing but most often its not going to help.
Just put a [branch] in front and make sure there are no gradient operations inside. So you should not use samplers, or at least have the sample gradient predefined if you want to read textures.
Same goes for for loops with early exits - don’t forget the [loop] in front. I found that to have major fps benefits, the compiler seems to never do it without these hints.
The new doom for example relies heavily on this instead of more Shader permutations, they touched on the subject in their engine presentations. So it’s clear it’s not a deal breaker with modern GPUs.
I used to avoid it but through personal testing it became clear that the penalties are very minor in comparison to the gains in comfort.
But in terms of raw performance a new technique is faster, if only slightly.
I’ve made some little benchmarks, and it appears the gain is dependant to the graphics card:
On my rig, with a nvidia670M, techniques are faster. (about 5/8fps for this test shader only)
On a friend’s rig, with a nvidia1080, techniques and branches are equal (but my shader is not the most intensive one for a graphics card)
So @kosmonautgames is right, it is comfortable with later cards. But for an old one like mine, it’s not an option as I want to have a polished engine that can go as fast as possible on old machines too. After all I’m not Crytek, pushing to buy new computers to play ^^ The most gamers can and are happy with the speed of the game, the more money…
I’ll do some further tests later when I finish my PBR implementation.
the irony is that Crytek uses the old “make a new technique for each shader permutation” thing, resulting in potentially thousands of different shaders (they are generated for the materials, not combined by hand obviously) and their main Rendering Engineer, Tiago Sousa, left Crytek to work at id software (DOOM) and just went with dynamic branching this time.
Sure I was thinking about FarCry when I wrote that. All started from a tech demo, to promote nvidia’s new cards (I don’t remember which one though, I’m older than yesterday and less than tomorrow^^) It then became the game, pushing players to buy the newest cards.
Is it better to use [loop] or [unroll] (for low values < 20 for ex) ?
If you have a nv 10 series yes.
Not tried with AMD though.
But it would be more accurate if tested with an intensive shader, mine was just a DepthOfField effect.
*with [branch] (if it’s more than trivial instructions)
The default compiler will [flatten] everything it seems. Sad!
20 is not a low value lol. Well if your for loop has no early out you might as well unroll, otherwise just loop everytime. It should be noted that unrolling will take the compile time waaay up if expensive.
I’ve never had the need of nested loops. I avoid them by flattening the 2D array and I’ve never had the need in a shader of a 2D loop for now.
But 20 is low compared to SSAO when you set 64 samples in the loop. Or some other algorithms that need 1024
I use unroll everywhere as I’m under 32, and it is not so slow to compile. The inner code in the loop is not so big.
64 samples for default SSAO is a lot, but probably in line for stuff like HBAO. Then again we are probably not integrating and blurring as well as the pros, I’ve read the Infinity Ward article on their latest AO, which is probably the best in business, and they sample in one direction per pixel only and get impressive results regardless.
Not me, but in a mathematics algorithms in a book at work. I don’t remember which algo, but I remember it shocked me as this seemed an over-reasonable amount. I was receiving a training and the rigs were using Quadro graphics cards…