SSE brings eight new processor registers XMM0-XMM7. Each XMM register has 128bit size what means we can store four 32bit floats into that. SSE instructions operate on these registers. On the image you can see usage of instruction ADDPS which adds numbers from XMM0 and XMM1 register one by one.
SSE brings a lot more instructions than adding two vectors. There are instructions for arithmetic, store/load, logic, compare, conversion and memory shuffling operations. Here is nice list of all instructions for most of SSE versions.
Wow, that's great but what about .Net? That's problem. You cannot use SSE instructions directly from any .Net language because these languages are compiled into MSIL which is (as everybody knows) platform independent. Someone may think it is JIT's job to compile MSIL with usage of SSE instructions. It should be but unfortunately JIT does not use it. I understand this situation. I think it's really difficult to automatically find possible places in program which can be computed with SSE. I believe everybody agrees with me when i say "JIT has to be as fast as possible". Therefore optimizations there cannot be so difficult to find.
The solution is to write native library with SSE optimized functions and call these functions from your .Net applications.
Here is a small example of C# and C with SSE used performance.
We've got two arrays a and b of 67108864 floats. We will compute this:
for every i from 0 to 67108863. C# code looks like code above and C code with SSE used is:
In C# version we used pointers instead of .Net arrays for better performance of C# test. This performance boost is hidden in omitting array boundaries checks. This speed test results speaks for itself. C# test takes 1428 ms and C+SSE test takes 349 ms on my laptop. It's approximately about 75% more efficient than C# version.
The solution is to write native library with SSE optimized functions and call these functions from your .Net applications.
Here is a small example of C# and C with SSE used performance.
We've got two arrays a and b of 67108864 floats. We will compute this:
float tmp = a[i] - b[i];
a[i] = a[i] * tmp;
b[i] = b[i] * tmp;
a[i] = a[i] + b[i];
a[i] = a[i] * tmp;
b[i] = b[i] * tmp;
a[i] = a[i] + b[i];
for every i from 0 to 67108863. C# code looks like code above and C code with SSE used is:
__m128 tmp = _mm_div_ps(ma[i], mb[i]);
ma[i] = _mm_mul_ps(ma[i], tmp);
mb[i] = _mm_mul_ps(mb[i], tmp);
ma[i] = _mm_add_ps(ma[i], mb[i]);
ma[i] = _mm_mul_ps(ma[i], tmp);
mb[i] = _mm_mul_ps(mb[i], tmp);
ma[i] = _mm_add_ps(ma[i], mb[i]);
In C# version we used pointers instead of .Net arrays for better performance of C# test. This performance boost is hidden in omitting array boundaries checks. This speed test results speaks for itself. C# test takes 1428 ms and C+SSE test takes 349 ms on my laptop. It's approximately about 75% more efficient than C# version.
Actually, SSE instructions can be used from .NET - Mono has support for them: http://www.go-mono.com/docs/monodoc.ashx?link=N%3aMono.Simd
ReplyDeleteYes, I read about this possibility but I'm talking mainly about Microsoft .Net Framework. I've never worked with Mono so I don't know much about it. But I'm quite interested in how much freedom Mono gives you when you write SSE enabled code. Do you know?
ReplyDelete