How to share data between threads/cores in a compute shader.

Optimistic_Peach · June 19, 2018, 12:35am

Hi there, I’m trying to implement a compute shader that acts as a sorting algorithm. I need to get lots of data passed between the cores/threads (I.E. no matter what instance of the function i’m running, I have access to the same data as the rest). I started using a system like so:

RWStructuredBuffer<uint> SharedGlobalParams : register(u0);
RWStructuredBuffer<uint> InWeights : register(u1);
RWStructuredBuffer<uint> OutIndexes : register(u2);
RWStructuredBuffer<bool> SyncSuccesses : register(u3);

This will prove hard as though I would have to initialize each one in c# with appropriate size (Correct me if I’m wrong here please). How would I achieve this? I’ve read about static and shared and groupshared but I’m not sure which one to use.

Here is the signature of my main function:

[numthreads(1, 1, 1)]
void CSMain(uint3 threadID : SV_DispatchThreadID);

Any help is appreciated!

AcidFaucent · June 22, 2018, 2:18pm

Your buffers are the only global and across processes memory you get.

You sure about shared? That’s an Effect format thing, not a DirectCompute thing unless there’s something new.

groupshared is for among invocations within the same Dispatch group. [numthreads(1, 1, 1)] makes that pointless, as Dispatch is for groupsX, groupsY, groupsZ.

This will prove hard as though I would have to initialize each one in c# with appropriate size (Correct me if I’m wrong here please).

They just need to be large enough and you can pass off some additional info for the ranges.

Optimistic_Peach · June 23, 2018, 7:43pm

Thanks alot @AcidFaucent. I will keep using a RWStructuredBuffer then.
One more question, I need to keep all my functions synced (Between all the threads and thread groups.) Should I write a function to do this or is there some kind of hidden function to do this? I’ve only found functions that sync within a thread group (Useless for me because of my use of numthreads(1, 1, 1).

AcidFaucent · June 24, 2018, 6:37pm

Not entirely sure what you mean. You might need to clarify that.

If you’re worried about async-compute/multiple-dispatches causing issues, it isn’t magical. You have to deliberately use async to need to worry.

(Between all the threads and thread groups.)

You can’t do this. You can only emulate by queueing up a chain of dispatches to do whatever sync it is you need to do in intermediate kernels based on the results of the previous ones.

If you aren’t familiar with it you should probably look into how Prefix-Sums are used to compact buffers, it has these sorts of problems and it’s a building-block used all the time.

All synchronization is group-based, there are no global syncs other than when the compute shader returns. Even then all you can really synchronize is memory access.

If you have something like a coalesce that needs to happen for each group before you can move onto the next compute shader (or continue in the current one) it’s fairly common to block and do that on the first-thread:

__appropriate_barrier__ (based on prior accesses)
if this_thread_ID == first_group_thread_ID then
    perform coalesce work
endif
__appropriate_barrier__ (based on accesses in the if, if more work is to be done in same kernel)

Optimistic_Peach · June 25, 2018, 9:48pm

As in, I have multiple cores doing a while loop, and I need them to all do the last action in the while loop at the same time. So that branching doesn’t affect when the code is run. I tried this (Simplified version):

void Sync()
{
    CoresSyncedParam += 1;
    while (CoresSyncedParam != CoresNumberParam)
    {
    }
    CoresSyncedParam = 0;
}

But this resulted in DXGI_ERROR_DEVICE_REMOVED with a DeviceRemovedReason HRESULT of 0x887A0006 (I already looked this up and it’s DXGI_ERROR_DEVICE_HUNG) This means my code is too slow and when I removed my Sync calls, it worked (But the algorithm didn’t work as intended because it wasn’t synced…).

How could I achieve this or is there a better way to do this?

PS. I found a Bug? in the formatting system… If I add two backticks that are used to format code one after another like so: ‘word’ ‘anotherword’ then I get formatting Like This Even Though There Is Multiple Spaces Between Them

kosmonautgames · December 4, 2018, 10:58pm

sorry for necro, but how did you guys manage to make compute shaders work? Any samples?

KakCAT · December 4, 2018, 11:29pm

Hi Kosmonaut, there’s an example here:

Optimistic_Peach · December 5, 2018, 12:02am

Sorry, @kosmonautgames, but I actually haven’t worked with MonoGame or C# for the matter for a few months and have actually been working with rust. From what I can recall, and by looking into my old project (Which is a mess by the way, I wouldn’t recommend going in there for ideas), I used GraphicsDevice.Handle and cast that to a D3D11.Device to access the inner Device and use directx directly.

What it does provide though, is a list of libraries which I found that worked well with the version of MonoGame I was using back then.

Optimistic_Peach · December 5, 2018, 12:56am

Oh, I just found this which was very useful while working on another project, and with the GraphicsDevice.Handle from my previous post, you can make it work.