Sunday, March 20, 2011

Speeding up sigmoid function by approximating exponential function

I will talk about sigmoid function defined by formula
where x is real and k is positive constant. Its graph is for k=4 shown bellow.

This function can be used for example as activation function in neural networks. When we use it as activation function in neural network a code evaluating it will be very stressed. So our approach should be to optimize it.
For simplicity we will assume k=4 for rest of this post. Classical implementation in C# may look like this simple code:
private static float Sigmoid(float signal)
    return 1f / (1f + (float)Math.Exp(-4f * signal));
Unfortunately there're not any single precision Math class in .Net. If we need only single precision values it's inefficient to calculate all values in double precision. The first idea how to solve this problem could be usage of C language where are single precision mathematical functions. But how it was shown in last post about calling unmanaged code from .Net it wouldn't be the best solution due to high frequency of calling unmanaged code.
The slowest part of this small code is of course evaluation of exponential function. In lot of situations  we do not need so precise results and we can use some kind of approximation. This is also a case of neural networks mentioned before. We will base our approximation on well known formula
We can take as an approximation of e^x formula (1+x/n)^n for some finite n. The higher n we select the closer approximation we will get. On the graph below we can see sigmoid function with approximated exponential for n=16.

There's again one small problem which we can see on next image.

As we can see the problem with approximated sigmoid function is that its limit for x goes to infinity is 0 but limit of sigmoid function for x goes to infinity is 1. We can see the problem is solvable by one simple if-statement. We find out that local maximum of approximated sigmoid function is at x=4.0. Now we just say that approximated function for x>4.0 will be 1. The code for approximated sigmoid function can look like this:
private static float Sigmoid(float signal)
    if (signal >= 4f) return 1f;
    float tmp = 1f - 0.25f * signal;
    tmp *= tmp;
    tmp *= tmp;
    tmp *= tmp;
    tmp *= tmp;
    return 1f / (1f + tmp);
This code looks more complex than original version but it's approximately 78% faster. I was testing the speed of both methods with numbers from interval (-0.5, 0.5). If we try some interval which contains numbers bigger than our boundary value 4.0 it should be even faster.

Thursday, March 17, 2011

Techniques of calling unmanaged code from .NET and their speed

Default technology for calling unmanaged code (for example C++ code) from .NET is Platform Invoke. It is available from every managed language which supports method attributes through DllImportAttribute. Usage of this attribute is shown below in C#:
static extern void TestCpp(float a, float b, float* result);
Platform Invoke is complex technology which ensures all necessary conversions of data structures from managed to unmanaged world and vice versa. You can use DllImportAttribute's named parameters or other attributes like InAttribute or OutAttribute. These options provide you possibility to more precisely control the invocation of native methods like custom marshaling or specifying calling conventions. Here is very good article about using PInvoke.
If you try to test PInvoke performance you will probably figure out that it's not so good. When we often call unmanaged code from our application we should think about amount of context switching from managed to unmanaged code because it is performance bottleneck. That's mainly due to security checks which .Net runtime performs before each call of unmanaged function. It goes through call stack and checks if every caller has appropriate rights. We can suppress this kind of behaviour by using SuppressUnmanagedCodeSecurity attribute. This can really help us improve performance when we need to do a lot of context switching.

Next technique is usage of GetDelegateForFunctionPointer method defined in Marshal class. In our unmanaged library we simply provide pointer to function we want to call and in our managed application we will create a delegate for this pointer by GetDelegateForFunctionPointer method. There is a same performance problem with call stack security checks in this solution. We can suppress this by use of SuppressUnmanagedCodeSecurity attribute on definition of delegate type we will use for our unmanaged function. In the end we will see this method of calling unmanaged code is not so effective as PInvoke. If we need to setup calling convention of this we just need to use UnmanagedFunctionPointer attribute. The example below shows usage of SuppressUnmanagedCodeSecurity and UnmanagedFunctionPointer attributes:
delegate void MyUnmanagedDelegate(float a, float b, float* result);
Last but not least technique is to use C++/CLI to create layer between managed and unmanaged code. We just need to write managed type which will provide our unmanaged functionality. That's quite easy because C++/CLI enables us to mix managed and unmanaged code. Simple C++/CLI code providing this functionality can look like this:
#pragma unmanaged
void unmanagedFunction(float a, float b, float* result)
 float tmp = a - b;
 *result = a * tmp + b * tmp;
#pragma managed
using namespace System;
namespace cliLib 
 public ref class cliClass
  static void CallUnmanagedFunction(float a, float b, float* result)
   unmanagedFunction(a, b, result);
The last method I will write about is kinda special. I discovered it a few weeks ago when I was working on library for speeding up math operations in XNA. It is based on magic calli instruction of MSIL language. Calli instruction stands for indirect method call. And what is the magic there? Calli instruction takes as main argument pointer to machine code which will start executing. That means we can give it pointer to unamanaged function. There's just one problem. When we try it, it doesn't work. It is due to calling convention which CLR uses. CLR uses fastcall calling convention what means arguments of function are passed into registers when possible. But it has simple solution. We just need to specify fastcall calling convention for our unmanaged function. After that we can write small MSIL library or use Reflection Emit to call our unmanaged function.

Here is a little performance test of mentioned techniques:
C# (informative)4318 ms
PInvoke - suppressed security5415 ms
Calli instruction5505 ms
C++/CLI6311 ms
Function delegate - suppressed security7788 ms
PInvoke8249 ms
Function delegate11594ms

I was testing simple code with float arithmetic in 67108864 iterations.

Wednesday, March 16, 2011

SSE Instructions

SSE acronym stands for Streaming SIMD Extensions, where SIMD is "Single Instruction, Multiple Data". It means we can use only one processor instruction to perform the same operation on multiple data. In SSE case we've got instruction set for processing four single precision float numbers at the same time.
SSE brings eight  new processor registers XMM0-XMM7. Each XMM register has 128bit size what means we can store four 32bit floats into that. SSE instructions operate on these registers. On the image you can see usage of instruction ADDPS which adds numbers from XMM0 and XMM1 register one by one.


 SSE brings a lot more instructions than adding two vectors. There are instructions for arithmetic, store/load, logic, compare, conversion and memory shuffling operations. Here is nice list of all instructions for  most of SSE versions.

Wow, that's great but what about .Net? That's problem. You cannot use SSE instructions directly from any .Net language because these languages are compiled into MSIL which is (as everybody knows) platform independent. Someone may think it is JIT's job to compile MSIL with usage of SSE instructions. It should be but unfortunately JIT does not use it. I understand this situation. I think it's really difficult to automatically find possible places in program which can be computed with SSE. I believe everybody agrees with me when i say "JIT has to be as fast as possible". Therefore optimizations there cannot be so difficult to find.

The solution is to write native library with SSE optimized functions and call these functions from your .Net applications.

Here is a small example of C# and C with SSE used performance.
We've got two arrays a and b of 67108864 floats. We will compute this:
float tmp = a[i] - b[i];
a[i] = a[i] * tmp;
b[i] = b[i] * tmp;
a[i] = a[i] + b[i];

for every i from 0 to 67108863. C# code looks like code above and C code with SSE used is:
__m128 tmp = _mm_div_ps(ma[i], mb[i]);
ma[i] = _mm_mul_ps(ma[i], tmp);
mb[i] = _mm_mul_ps(mb[i], tmp);
ma[i] = _mm_add_ps(ma[i], mb[i]);

In C# version we used pointers instead of .Net arrays for better performance of C# test. This performance boost is hidden in omitting array boundaries checks. This speed test results speaks for itself. C# test takes 1428 ms and C+SSE test takes 349 ms on my laptop. It's approximately about 75% more efficient than C# version.