c++ - SSE / Optimisation - duplicating array into larger array

Thursday, 25 May 2017

c++ - SSE / Optimisation - duplicating array into larger array

I'm trying to optimize the following function:
(Basically it takes a line of 32bit Ints, and duplicates each int into a larger destination arrray, and then duplicates each line

for(int i = 0; i < numLines; i++)
{
    pStartOfLine  = pDest;
    for(int j = 0; j < intsPerLineSrc; j++)
    {
        *pDest = *pSrc;                     // copy pixel A to FullSizeBuffer A
        pDest++;                            // Move dest Ptr to next Pixel
        *pDest = *pSrc;                     // Copy pixel A to FullsizeBuffer AGAIN  

        pDest++;                            // Move Src and Dst Pointrs to next pixels

       pSrc++;
    }

   memcpy(pDest, pStartOfLine, (8*intsPerLineSrc) );            // Duplicate the Line written to pDest, to next line of pDest.
   pDest = pDest + (2*intsPerLineSrc);                          // move pDest to Start of Next Line
}

Effectively scaling an image up to 2 * it's original size in both dimensions.
Now this strikes me as something that should benefit massively from SIMD, however i cannot seem to find the right set of intrinsic instructions that will assist me in this specific case.

Anyone care to help me out?
Or will i always be memory limited in such a simple operation that re-factoring in SIMD is a waste?

Yes this section of code ends up running in multiple threads, so it is already heavily multi-threaded, but i think that SIMD optimization may be even more helpful.

Cheers, for any help / advice,

James

Answer

Your current operation is memory bandwidth bound.

If you can find a way to not process the entire image but instead process blocks (e.g. 16x16 pixel blocks to 32x32 pixel blocks) and do other computations each block then you may be able to make your operations less memory bandwidth bound.

But if you have to process the entire image there are a few things you should consider to achieve the maximum memory bandwidth:

For memory bandwidth bound operations they don't scale with the number of cores but they do scale with the number of sockets. So if you have a dual socket system the memory bandwidth is twice a single socket system (assumption both sockets use the same processor which they usually do). However, achieving twice the bandwidth can be tricky.

The memcpy function is typically not optimized for copying large sizes. One of the main reasons is that many implementations of it don't use non-temporal stores. The rule of thumb for non-temporal stores is to use them when the size is larger than twice the size of slowest cache. Let's assume your processor has a 12 MB L3 cache. Then if the size of the destination image is larger than 6MB you should consider using non-temporal stores. This is almost certainly your case since you're code is writing 32 MB.

Here is an example of how you can use both SSE2 and non-temporal stores

int main() {
    int n = 16;
    int *src = (int*)_mm_malloc(n*sizeof(int),   16); //16 byte aligned 
    int *dst = (int*)_mm_malloc(2*n*sizeof(int), 16); //16 byte aligned
    for(int i=0; i    for(int i=0; i        __m128i  x = _mm_load_si128((__m128i*)&src[i]);

        __m128i lo = _mm_shuffle_epi32(x, 0x50); // 0x50 = 1100 in base 4  
        __m128i hi = _mm_shuffle_epi32(x, 0xfa); // 0xfa = 3322 in base 4
        _mm_stream_si128((__m128i*)&dst[2*i+0], lo); //non-temporal store
        _mm_stream_si128((__m128i*)&dst[2*i+4], hi); //non-temporal store
        //_mm_store_si128((__m128i*)&dst[2*i+0], lo);
        //_mm_store_si128((__m128i*)&dst[2*i+4], hi);
    }
    //for(int i=0; i    //for(int i=0; i<(2*n); i++) printf("%x ", dst[i]); printf("\n");
}

In your case replace n by the number of pixels. If n is not a multiple of four then you have to do a little clean up which I did not do here. The temporal stores have to be 16 byte aligned in order to do this which is why I aligned dst. However, src does not have to be 16 byte aligned so you could use _mm_loadu_si128 and not align src.

Once you achieve maximum bandwidth for a single thread and assuming you have a multi-socket system you should try and achieve maximum bandwidth from both sockets. I don't have enough experience with this to help but I think it can be achieved using numactl. See why-doesnt-this-code-scale-linearly for an example.

Blog

Thursday, 25 May 2017

c++ - SSE / Optimisation - duplicating array into larger array

No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?