Wednesday, March 16, 2011

SSE Instructions

SSE acronym stands for Streaming SIMD Extensions, where SIMD is "Single Instruction, Multiple Data". It means we can use only one processor instruction to perform the same operation on multiple data. In SSE case we've got instruction set for processing four single precision float numbers at the same time.
SSE brings eight  new processor registers XMM0-XMM7. Each XMM register has 128bit size what means we can store four 32bit floats into that. SSE instructions operate on these registers. On the image you can see usage of instruction ADDPS which adds numbers from XMM0 and XMM1 register one by one.

 

 SSE brings a lot more instructions than adding two vectors. There are instructions for arithmetic, store/load, logic, compare, conversion and memory shuffling operations. Here is nice list of all instructions for  most of SSE versions.

Wow, that's great but what about .Net? That's problem. You cannot use SSE instructions directly from any .Net language because these languages are compiled into MSIL which is (as everybody knows) platform independent. Someone may think it is JIT's job to compile MSIL with usage of SSE instructions. It should be but unfortunately JIT does not use it. I understand this situation. I think it's really difficult to automatically find possible places in program which can be computed with SSE. I believe everybody agrees with me when i say "JIT has to be as fast as possible". Therefore optimizations there cannot be so difficult to find.

The solution is to write native library with SSE optimized functions and call these functions from your .Net applications.

Here is a small example of C# and C with SSE used performance.
We've got two arrays a and b of 67108864 floats. We will compute this:
 
float tmp = a[i] - b[i];
a[i] = a[i] * tmp;
b[i] = b[i] * tmp;
a[i] = a[i] + b[i];

for every i from 0 to 67108863. C# code looks like code above and C code with SSE used is:
 
__m128 tmp = _mm_div_ps(ma[i], mb[i]);
ma[i] = _mm_mul_ps(ma[i], tmp);
mb[i] = _mm_mul_ps(mb[i], tmp);
ma[i] = _mm_add_ps(ma[i], mb[i]);

In C# version we used pointers instead of .Net arrays for better performance of C# test. This performance boost is hidden in omitting array boundaries checks. This speed test results speaks for itself. C# test takes 1428 ms and C+SSE test takes 349 ms on my laptop. It's approximately about 75% more efficient than C# version.