Fastest way to draw an array of bytes

I have a pointer to an array of bytes that is populated by a c++ application (Snes9X). Basically what i need is to read that array and render to the screen using Monogame. That array points to the Snes9x’s backbuffer and its (at least) 256 * 224 * 2 bytes big (width * height * bytes per pixel).

My first idea was convert that array into a Texture2D and draw to the screen, but i think this would be very slow since i’d have to update the texture 60 times per second. Plus, i’d also have to convert the 16bpp pixel format to RGB before updating the texture.

Is there a better way to achieve this?

As long as the image you wanna draw does not originate from the GPU, you’d have to send it 60 times per second no matter what.

A Texture2D is not limited to 16bpp, check the constructor, there is one which takes a SurfaceMode (I think it was called surfacemode) but I cant say if there is any which would just fit you

But tbh - 256x224 shouldn’t be that big of a deal, especially as you’re dealing with flat arrays anyway - I would just give it a try to do it regulary by updating the texture every frame.

1 Like

The problem is that modern GPU architecture is very different from the old ways, like NES where the hardware had direct access to each pixel, in modern GPUs textures are stored in the graphic card memory and handled by the GPU by giving small number of instructions from the CPU so there is no need to tell each pixel what it has to display. So if you use modern GPUs you will have to use very slow processes like setting each pixel per frame, and I believe this is not monogame specific, it is architecture of modern graphics.

1 Like

I tried and its not good, just converting the array to a Texture2D and rendering it is taking 15% of my CPU (core i5).

Converting from 16bpp to RGB inside the shader might help. I’m also considering using a VBO or some other way to send data to the GPU.

You can pass insane amount of data per frame between CPU and GPU, that bus is insanely fast thesedays. We have entire techs based on it, for example Virtual Texturing. It is not very slow process, it is extremely fast process, not as fast as GPU → GPU fast (hence it is considered bottleneck in some cases) but still very fast with very, very high bandwidth. Few thousands bytes per frame is absolutely, absolutely nothing, we can do way more. We also have “os level” features that are built for high CPU to GPU traffic, check Direct Storage from microsoft, not that it is applicable at all in this case, I am just pointing out why thinking that modern architecture arent build for CPU->GPU high bandwidth traffic is wrong statement.

Additionally for future optimization we can create (in dx, tho openGL has own variant) buffers with Dynamic usage flag, we will sacrifice GPU read speed for some CPU write speed, useful for buffer we need write into MULTIPLE times PER update. For example it is common for constant buffers.

Use the Surface format Bgr565 when creating the texture.

public Texture2D(GraphicsDevice graphicsDevice, int width, int height, bool mipmap, SurfaceFormat format)`
// then call SetData():
texture.SetData<ushort>(data); // where data is the array.

No conversion would be required.

See the following link for changing the format if needed:

CPU utilization is not a good indicator … if you turn off FixedTimeStep, your utilization will be (almost) 100% (minus IO wait etc), no matter what u do.

So with fixed 60 FPS you have 16 ms per frame and you can basically utilize it as it’s needed for your game. Copying an flat array into an array of double the size isn’t really that expensive, so maybe you have bottlenecks somewhere?

Unfortunately I have no idea how the 2bytes of snes are encoded to feature 3 values, but there is a SurfaceFormat for single 16 bit and there is one with 2x8bit which should at least spare you a conversion on CPU side and you can do it on the GPU instead

edit: oh someone already found a good SurfaceFormat, so never mind :slight_smile:

C# managed array access is quite slow due to index validation and dereferencing for each access.

For faster operations pin both arrays in memory using an unsafe context and use pointer arithmetic to read, modify, and write the memory contents.

You could also utilize the multithreading Parallel.For() to utilize multiple cores. This is thread safe for same index reads and writes of integral types.

Thanks, that worked nicely and it’s not consuming much CPU.