Any reason why Vector2 was deliberately coded slower?

BlackGoose · March 1, 2023, 9:32am

I guess this might be ancient and was never changed by the Monogame Team. I noticed this while playing around with performance profiling and noticed my custom struct which essentially does the same thing as Vector2 when calling the operators +,-,*,/ performed faster than Vector2. I checked with System.Numerics.Vector2 which had similar performance to my own struct.

I took a look into the source code and found that Microsoft.Xna.Framework.Vector2, for some reason, creates 2 copies on every + operation by calling += on it (if I’m not mistaken, I think that’s what happens when you try to change a value on a struct) instead of just creating one new(x1 + x2, y1 + y2)

Microsoft.Xna.Framework.Vector2

        public static Vector2 operator +(Vector2 value1, Vector2 value2)
        {
            value1.X += value2.X;
            value1.Y += value2.Y;
            return value1;
        }

System.Numerics.Vector2

        public static Vector2 operator +(Vector2 left, Vector2 right)
        {
            return new Vector2(left.X + right.X, left.Y + right.Y);
        }

My custom Size struct

		public static Size operator +(Size s1, Size s2)
		{
			return new Size(s1.Width + s2.Width, s1.Height + s2.Height);
		}

Same thing with all other operators. My performance profiling code (using DotNetBenchmark) just calls +,*,-,/ a million times (and , granted, creates new structs, but I doubt that’s relevant to the actual difference, given the constructors of every struct just assigns 2 field values, ie does the same thing for all 3 test cases).

		[Benchmark]
		public void MicrosoftXnaVector2()
		{
			var x = new Microsoft.Xna.Framework.Vector2(0, 0);
			for (var i = 1; i < it; i++)
			{
				x += new Microsoft.Xna.Framework.Vector2(i, i);
				x *= new Microsoft.Xna.Framework.Vector2(i, i);
				x -= new Microsoft.Xna.Framework.Vector2(i, i);
				x /= new Microsoft.Xna.Framework.Vector2(i, i);
			}
		}

BlackGoose · March 1, 2023, 3:10pm

Tested it with a copy of my custom struct (SizeWorse) which uses the same approach as Xna.Vector2. The += does seem to be the issue. Certainly not noticable at all in most scenarios, but it would be an easy change to make and doesn’t sacrifice anything.

MysticRiverGames · March 1, 2023, 3:41pm

how many times you run the loop?
for high performance games with lots of things going on, this can be critical, I released a game that updates 1000s of enemies, bullets and things like that, and I had to squeeze every little bit of juice I could get so it doesn’t slow down, and I am using a lot of Vector2. Maybe I can squeeze more juice by changing V2.

BlackGoose · March 1, 2023, 6:07pm

The loop you see in MicrosoftXnaVector2() is run 1 million times. Note that each iteration does 4 operations on Vector2, i.e. add,subtract,multiply,divide.

reiti.net · March 1, 2023, 6:11pm

the += is applied to the public float field, not the struct, so technically that should not result in a “new” struct … am I wrong here?

Also maybe try only benchmarking the addition, as I am not sure if that is the problem.

The division in XNA version replaced two divisions with 1 division + 2 multiplications … is that really faster with the additional register copies involved … I assume it is under many circumstances, but that’s something the Numerics one does just straightforward. Could be the initial “1 / divider” introduces an unneeded cast, as the 1 is assumed as int…

would be interesting what ILC is generated for the both additions

BlackGoose · March 1, 2023, 6:54pm

Only running it for addition still yields the same difference in performance.

IL for Microsoft.Xna.FrameWork.Vector2::op_Addition:

.method public hidebysig specialname static 
        valuetype Microsoft.Xna.Framework.Vector2 
        op_Addition(valuetype Microsoft.Xna.Framework.Vector2 value1,
                    valuetype Microsoft.Xna.Framework.Vector2 value2) cil managed
{
  // Code size       36 (0x24)
  .maxstack  8
  IL_0000:  ldarga.s   value1
  IL_0002:  ldflda     float32 Microsoft.Xna.Framework.Vector2::X
  IL_0007:  dup
  IL_0008:  ldind.r4
  IL_0009:  ldarg.1
  IL_000a:  ldfld      float32 Microsoft.Xna.Framework.Vector2::X
  IL_000f:  add
  IL_0010:  stind.r4
  IL_0011:  ldarga.s   value1
  IL_0013:  ldflda     float32 Microsoft.Xna.Framework.Vector2::Y
  IL_0018:  dup
  IL_0019:  ldind.r4
  IL_001a:  ldarg.1
  IL_001b:  ldfld      float32 Microsoft.Xna.Framework.Vector2::Y
  IL_0020:  add
  IL_0021:  stind.r4
  IL_0022:  ldarg.0
  IL_0023:  ret
} // end of method Vector2::op_Addition

I don’t know how to get the full IL for System.Numerics.Vector2, but here’s the IL for a simple custom struct which uses the same code as in the System.Numerics.Vector2:

.method public hidebysig specialname static 
        valuetype MyVec  op_Addition(valuetype MyVec v1,
                                     valuetype MyVec v2) cil managed
{
  // Code size       32 (0x20)
  .maxstack  8
  IL_0000:  ldarg.0
  IL_0001:  ldfld      float32 MyVec::X
  IL_0006:  ldarg.1
  IL_0007:  ldfld      float32 MyVec::X
  IL_000c:  add
  IL_000d:  ldarg.0
  IL_000e:  ldfld      float32 MyVec::Y
  IL_0013:  ldarg.1
  IL_0014:  ldfld      float32 MyVec::Y
  IL_0019:  add
  IL_001a:  newobj     instance void MyVec::.ctor(float32,
                                                  float32)
  IL_001f:  ret
} // end of method MyVec::op_Addition

reiti.net · March 1, 2023, 7:35pm

that’s very interesting actually - the numerics version dont need to store intermediate values onto the stack, but rather fill the ctor from the stack directly.

side question you did this with release built and code optimization on, right?

BlackGoose · March 1, 2023, 7:44pm

Yes, Release Build with the optimization checkbox under Project Properties → Build → General activated in VS2022

BlackGoose · March 1, 2023, 8:13pm

Just to be clear, the same performance difference can be observed when comparing multiply vs multiply, subtract vs subtract, etc between Xna.Vector2 and Numerics.Vector2. I merely focused on addition in my explanation as an example.

nkast · March 2, 2023, 12:05am

It’s because when it was written, calling the constructor was slower than reusing value1.
You can see here the generated 80x86 code. net4-x86

There is also a faster alternative. To define a 3rd local variable and initialize it locally.

		public static Vector2 operator +(Vector2 value1, Vector2 value2)
		{
			Vector2 result;
			result.X = value1.X * value2.X;
			result.Y = value1.Y * value2.Y;
			return result;
		}

In net4-x64 things got even.
Your custom Struct version is the only one using both xmm0 and xmm1.
I think the other two versions will stall the cpu pipeline on xmm0.
But it still initializing the new variable to zeros, and calling the construction.

In the latest .net, the version with the 3rd variable and the new constructor are identical.
The code that reuse value1 remained as it was in .net4 x64. My safe bet is to use the code above with the ‘result’ variable. This is also what MonoGame is using elsewhere, usually when the values cannot be reused.

.netCore

nkast · March 2, 2023, 4:29am

I see now that the last link is for Core CLR 7.0.222.60605 on x86.
The code generated from .net 7-x64 is a little worst.
If you have the code handy perhaps you can Benchmark the 3rd option
and share your results.

BlackGoose · March 2, 2023, 5:38am

Thank you for the answer!

The 3rd option yields the same benchmark result for me (.net 7.0 x64), but yeah, the code is different:

.method public hidebysig specialname static 
        valuetype GenericTest.Vec2_3  op_Addition(valuetype GenericTest.Vec2_3 a,
                                                  valuetype GenericTest.Vec2_3 b) cil managed
{
  // Code size       42 (0x2a)
  .maxstack  3
  .locals init (valuetype GenericTest.Vec2_3 V_0)
  IL_0000:  ldloca.s   V_0
  IL_0002:  ldarg.0
  IL_0003:  ldfld      float32 GenericTest.Vec2_3::X
  IL_0008:  ldarg.1
  IL_0009:  ldfld      float32 GenericTest.Vec2_3::X
  IL_000e:  add
  IL_000f:  stfld      float32 GenericTest.Vec2_3::X
  IL_0014:  ldloca.s   V_0
  IL_0016:  ldarg.0
  IL_0017:  ldfld      float32 GenericTest.Vec2_3::Y
  IL_001c:  ldarg.1
  IL_001d:  ldfld      float32 GenericTest.Vec2_3::Y
  IL_0022:  add
  IL_0023:  stfld      float32 GenericTest.Vec2_3::Y
  IL_0028:  ldloc.0
  IL_0029:  ret
} // end of method Vec2_3::op_Addition

nkast · March 3, 2023, 9:01am

I suppose the numerics will come on top in the case of Vector4 and or divisions.

It s good to know that the old trick still holds.
Thanks.